Distributed Web Systems with Consul, Diplomat, Envoy and HAProxy
As part as my never-ending quest to improve how I build cool things, I’ve been working for some time on building out infrastructure to help automate and monitor how my apps and servers are doing. I’ve written about horizontal scaling before, but today I’d like to get into one specific facet of its implementation: automated network discovery, and how we use it at FarmGeek to build reliable applications.
The Problem
So lets say you have a few servers - a node balancer, two application servers and a database server, for example. Everything’s working fine until BAM, one of your application servers crashes. To make things worse, in this scenario for an unexplained reason nobody finds out about this. However your HAProxy checks work and so the node leaves the connection pool as expected.
Your server capacity just silently halfed in size, without any notifications and with no way of recovering from the problem. That’s not good.
There are a bunch of problems with the “standard” setup being described here:
- There’s no way of understanding what resources are available among the servers currently switched on - every server suffers from a “Network Blindness”.
- HAProxy’s checks fail silently.
- There’s no way of handling IP changes or new servers without manually editing HAProxy’s config.
Using Consul, and with some help from Diplomat and Envoy, we aim to fix all three of these issues.
Introducing Consul
The first problem on this list can be solved with the help of a handy little idea known as Automated Service Discovery. One such implementation is Consul by the lovely fellows at Hashicorp, which is our weapon of choice at FarmGeek.
There are three core things which Consul can do which helps us:
- It provides a distributed Key-Value store which allows us to persist configuration data across a network, thus allowing our services to become more portable and easier to run in parallel - as they can share configuration data between each other without relying on a datastore being present.
- It provides a DNS service for services on the network which allows our servers to become more “Network Aware” with almost zero extra work. The DNS service also doubles as a simple Load Balancer.
- It provides health checks against those services, and will remove them from the DNS pool if they begin to fail.
Of course, Consul does a heap of other things for us, but we’ll focus on these three main things today as they’re the most relevant to the solving of our problem.
I’m not going to go over installing Consul here, as there’s a brilliant tutorial on Consul.io, but I will explain services, as they’re the key to how we achieve a fully distributed system.
A Service is defined in Consul with (you guessed it) a Service Definition. A Service Definition outlines what kind of service we’re describing, which port it’s on, and what we have to run to check its health. I recommend at least running service checks on the database and the application instances. You can check the service however you want (bash script, ruby script, etc). The main stipulation is that you return a number that’s not zero for less-than-perfect results. This allows Consul to decide if a service is unhealthy or not. This in turn allows consul to remove dodgy services from the pool of connections.
Another important point is how Consul’s DNS API works. Yes - Consul has a DNS
API. The way that it works is simple: it provides you with a random IP if you
send it a specially crafted domain to resolve. It can even give you more
detailed version if you use the SRV command. Very cool. But the question is, how
do you get your app (or any tool for that matter) to send DNS requests to
consul? At FarmGeek, we’re using DNSMasq to achieve this. All you need to do, is
install consul using their guide, install DNSMasq, and then create a
/etc/dnsmasq.d/10-consul
file with the following contents:
server=/consul/127.0.0.1#8600
Restart dnsmasq and you’ll be able to resolve consul’s \*.consul domains without breaking your regular DNS resolution. Simple!
Introducing Diplomat
Consul allows our servers to talk to one another and to check on the services on our servers, but how do our apps talk to consul? Consul has a DNS and a HTTP API for us to use, and Diplomat is a lightweight ruby wrapper for the HTTP API. At FarmGeek, we use it to store basic configuration data amongst our servers that we’d traditionally provide within Environment Variables.
To use Diplomat, simply add it to your Gemfile, then use Diplomat’s static methods anywhere where you’d like to get or set key-value data.
An example use-case would be to configure rails’ database connection. The example used in the README looks like this:
<% if Rails.env.production? %>
production:
adapter: postgresql
encoding: unicode
host: <%= Diplomat::Service.get('postgres').Address %>
database: <%= Diplomat.get('project/db/name') %>
pool: 5
username: <%= Diplomat.get('project/db/user') %>
password: <%= Diplomat.get('project/db/pass') %>
port: <%= Diplomat::Service.get('postgres').ServicePort %>
<% end %>
However, since we have DNS resolution working now, we could have Consul balance
our API connections by setting the host to postgres.service.consul
, and if we
have more than one postgres service available in the network, we’ll be randomly
switched between them automatically.
Introducing Envoy
NB: Envoy is now unsupported as it has been usurped by consul templates. Use them instead!
At this point our servers are aware of one another, our services can are aware of one another, and our apps are able to share configurations. The final step is to connect our apps to our services. Usually this is straight forward. In the case of HAProxy, however, it’s a bit more tricky.
So we came up with Envoy, a really simple NodeJS script FarmGeek have released on Github under the MIT license to connect HAProxy to Consul. It’s designed to be very hackable and lightweight, and it should run on each HAProxy server. Envoy will reload your config simply by calling `service haproxy reload`, so it may require sudo.
To use Envoy, clone the repository onto your server, add in a haproxy template based on the sample one in the repository, and run it (as a service, preferably). Envoy will periodically poll Consul for changes, and if it finds any, it’ll replace your haproxy config and reload. Simple! I’ve outlined an example configuration to serve as a way of explaining what envoy does:
global
log 127.0.0.1 local0
log 127.0.0.1 local1 notice
chroot /var/lib/haproxy
daemon
maxconn 4096
stats timeout 30s
stats socket /tmp/haproxy.status.sock mode 660 level admin
user haproxy
group haproxy
# Default ciphers to use on SSL-enabled listening sockets.
# For more information, see ciphers(1SSL).
ssl-default-bind-ciphers RC4-SHA:AES128-SHA:AES256-SHA
defaults
log global
mode http
option httplog
option dontlognull
option redispatch
retries 3
maxconn 2000
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
listen stats :1234
mode http
stats enable
stats uri /
stats refresh 2s
stats realm Haproxy\ Stats
stats auth username:password
frontend incoming
bind *:80
reqadd X-Forwarded-Proto:\ http
mode http
acl api hdr_dom(host) -i api.farmer.io
acl web hdr_dom(host) -i farmer.io
<% if (services.indexOf('api') > -1) { %>
use_backend api if api
<% } %>
<% if (services.indexOf('web') > -1) { %>
use_backend web if web
<% } %>
frontend incoming_ssl
bind *:443 ssl crt /etc/ssl/ssl_certification.crt no-sslv3 ciphers RC4-SHA:AES128-SHA:AES256-SHA
reqadd X-Forwarded-Proto:\ https
mode http
acl api hdr_dom(host) -i api.farmer.io
acl web hdr_dom(host) -i farmer.io
<% if (services.indexOf('api') > -1) { %>
use_backend api if api
<% } %>
<% if (services.indexOf('web') > -1) { %>
use_backend web if web
<% } %>
<% services.forEach(function(service) { %>
backend <%= service %>
# Redirect to https if it's available
redirect scheme https if !{ ssl_fc }
# Data is proxied in http mode (not tcp mode)
mode http
<% backends[service].forEach(function(node) { %>
server <%= node['node'] + ' ' + node['ip'] + ':' + node['port'] %>
<% }); %>
<% }); %>
I won’t go over how HAProxy works, as there’s plenty of guides on the internet on that, but let’s break dive into the areas which aren’t “standard” compared to most configs:
frontend incoming
bind *:80
reqadd X-Forwarded-Proto:\ http
mode http
acl api hdr_dom(host) -i api.farmer.io
acl web hdr_dom(host) -i farmer.io
<% if (services.indexOf('api') > -1) { %>
use_backend api if api
<% } %>
<% if (services.indexOf('web') > -1) { %>
use_backend web if web
<% } %>
Line 5 - acl api hdr_dom(host) -i api.farmer.io
- is using HAProxy’s access
control list system to create the variable “api” if the incoming traffic is
requesting the hostname api.farmer.io
. In line 8, we then use that variable to
decide whether to use the backend or not. However we must also check that consul
has a backend of the same name, and so in line 7 we check that consul has a
backend to match before we try to use it.
<% services.forEach(function(service) { %>
backend <%= service %>
# Redirect to https if it's available
redirect scheme https if !{ ssl_fc }
# Data is proxied in http mode (not tcp mode)
mode http
<% backends[service].forEach(function(node) { %>
server <%= node['node'] + ' ' + node['ip'] + ':' + node['port'] %>
<% }); %>
<% }); %>
In this segment, we’re taking all the services that Envoy has found through Consul and spitting them out as backend services. Part of this includes spitting out all of the healthy nodes attached to the service, which can be seen in lines 7-9.
When all is said and done
Now we’ve connected up our systems, we’ve made a great stride towards building a more fault-tolerant system.
Next steps are:
- Building an orchestration tool which can cut costs by powering servers up or down depending on load and health.
- Building a notification tool which alerts the admin when something acts oddly (perhaps signalling a bug).
- Handling distributed storage is also something that needs to be addressed.