Consul, a new tool for service discovery and configuration (opens in new tab)

(hashicorp.com)

135 pointsBummerCloud12y ago57 comments

57 comments

Those guys are machines.

Usually when people release open source software, the documentation is lacking, there's no website etc... those guys absolutely nail it every single time.

Kudos for them, really!

samstokes12y ago

Registered services and nodes can be queried using both a DNS interface as well as an HTTP interface.

This is very cool. Integrating with a name resolution protocol that every existing programmer and stack knows how to use (often without even thinking about it) should lead to some magical "just works" moments.

mikebabineau12y ago

See also SkyDNS, another service discovery system: http://blog.gopheracademy.com/skydns

In common with Consul:

* DNS interface

* Operates as a distributed cluster

* Uses Raft for consensus

armon12y ago

SkyDNS is compared to Consul here: http://www.consul.io/intro/vs/skydns.html

allengeorge12y ago

IIRC, the Chubby paper mentions how Google put protocol conversion servers in front of Chubby to convert DNS queries into lookups into a Chubby cluster. Here, it's built right into the service itself.

stormbrew12y ago

So I'm mostly curious why this isn't just basically serf 2.0. Looking at serf I never really felt like it had much use in the basic form it took, with no ability to advertise extra details about the nodes in a dynamic fashion. Consul seems to build onto serf the things that serf needed to become really useful, so seems more like a successor to serf than a parallel project.

It seems like the right thing to do here would be to take the lessons of building consul into making serf something more like a library on which to build other things rather than a service in its own right.

mitchellh12y ago

To add to what Armon said: we always had Consul in mind when building Serf. But Serf was a necessary building block on the way to Consul: highly available, lightweight membership management. Every distributed system has a membership problem, and we didn't want to reinvent the wheel for every system we built. Serf provided an incredibly stable and well proven foundation for us to build Consul on top, as well as future products that are already in the works...

And while you may not see Serf as having much use, we've personally helped and seen Serf clusters with many thousands of nodes. Serf is very useful to these organizations for its purpose. And while some of these orgs are now looking at Consul, many don't need Consul in the same way (but may deploy it separately).

We're not stopping with Consul. We have something more on the way. But we now have some great building blocks and experience building distributed systems to keep doing it correctly without having to rebuild everything from scratch.

stormbrew12y ago

It would be pretty awesome if you could get some case studies out about these deployments of serf. I would really like to hear more about them.

armon12y ago

The two offer very different trade offs and have different architectures. Serf is an AP system, meaning it trades of consistency for availability. 90% of your nodes can be down and a Serf cluster will continue to operate. It does offer a primitive set of features, but it is useful for a wide variety of tasks such as configuring a memcache pool, load balancers, P2P VPN topologies, etc.

Consul is a CP system, meaning it trades availability for consistency. It has a much more limited ability to tolerate failures. However, its more central architecture allows it to support a richer feature set.

By keeping the tools separate we give developers and operators two different tools. Sometimes you need a hammer, and sometimes a screwdriver will do.

This page compares the two: http://www.consul.io/intro/vs/serf.html

addisonj12y ago

Very impressed.

This coalesces a lot of different ideas together into what seems to be a really tight package to solve hard problems. In looking around at what most companies are doing, even startupy types, architectures are becoming more distributed and a (hopefully) solid tool for discovery and configuration seems like a big step in the right direction.

noelwelsh12y ago

Looks like a very cool tool -- could replace Zookeeper with saner admin requirements -- but I'm more interested in the tech. AP systems (such as Serf, on which Consul is built) have many advantages and I think we're only just beginning to see their adoption. I believe CRDTs are the missing ingredient to restore sanity to inconsistent data. Add that and I can see a lot more such systems being deployed in the future (and particularly in mine :-)

MechanicalTwerk12y ago

Seriously, who does design for HashiCorp? Their site designs, though similar, always kill it.

cheeseprocedure12y ago

Agreed! I also appreciate that they tend to drop with robust documentation pages already in place.

MechanicalTwerk12y ago

Yep. I'm kind of amazed at the level polish in their initial releases given that HashiCorp is what, 3 people? And they're up to 4 products now with Serf being released just 6 months ago and Packer not long before that.

hardwaresofton12y ago

This is really awesome, distributed system techniques in the real world. I'm really jealous of what they've managed to build.

I was planning to make a tool like this (smaller scale, one machine), and this will certainly serve as a good guide on how to do it right (or whether I should even bother at all).

I can't find a trace of a standard/included slick web interface for managing the clusters and agents -- are they leaving this up to a 3rd party (by just providing the HTTP API and seeing what people will do with it)? Is that a good idea?

armon12y ago

We didn't manage to finish it in time, but it should be released in the next two weeks or so!

hardwaresofton12y ago

I can only imagine how awesome it's going to be, given the excellent design (simple, readable, content-focused) of the other work you guys are doing.

If I may ask, it seems like the design of the consul site is one step (iteration) away from the serf site (particularly, the docs pages -- some subtle changes made a large difference)... I agree with the others here, really dig the site, and big text definitely doesn't hurt deeply technical descriptions architecture page was very readable for me

1 more reply

igor4712y ago

i am constantly impressed at the hashicorp guys, who continue to release great tools. they actually released serf on the same day as we released nerve and synapse, which comprise airbnb's service registration and discovery platform, smartstack. see https://github.com/airbnb/nerve and https://github.com/airbnb/synapse

that said, as i wrote my blog post on service discovery ( http://nerds.airbnb.com/smartstack-service-discovery-cloud/ ), dns does not make for the greatest interface to service discovery because many apps and libraries cache DNS looksups.

an http interface might be safer, but then you have to build a connector for this into every one of your apps.

i still feel that smartstack is a better approach because it is transparent. haproxy also provides us with great introspection for what's happening in the infrastructure -- who is talking to whom. we can analyse this both in our logs via logstash and in real-time using datadog's haproxy monitoring integration, and it's been invaluable.

however, this definitely deserves a look if you're interested in, for instance, load-balancing UDP traffic

armon12y ago

I think that Synapse already provides a plugin architecture. I t should be trivial to use the Consul HTTP API as a Synapse plugin. All this does is bridge the information Consul has and setup an HAProxy instance locally. If you want to go that route, should be pretty simple!

sagichmal12y ago

The underlying Raft implementation is brand new, and looks much improved on the goraft used by etcd. Very impressed.

allengeorge12y ago

This is really impressive - kudos! I'm jealous - these guys are implementing extremely cool stuff in the distributed systems arena :) (serf - http://serfdom.io - comes to mind)

How much time did it take to put this together?

Loic12y ago

Short question: Can I define the IP of the service in the service definition?

From the service definition[0] it looks like the IP is always the IP of the node hosting `/etc/consul.d/*` files. I am thinking about it in a scenario where each service (running in a container) is getting an IP address on a private network which is not the IP of the node.

[0]: http://www.consul.io/docs/agent/services.html

Update: An external service is possible: http://www.consul.io/docs/guides/external.html

contingencies12y ago

This sounds very impressive, at the risk of breaking the chorus of awesome: what problem does this actually solve?

Discovery: The consul page alleges that it provides a DNS compatible DNS alternative for peer discovery but is unclear as to what improvements it offers other than 'health checks', with the documentation leaving failure resolution processes unspecified (as far as I can see) thus mandating a hyper-simplistic architecture strategy like run lots of redundant instances in case one fails. That's not very efficient. (It might be interesting to note that at the ethernet level, IP addresses also provide MAC address discovery. If you are serious about latency, floating IP ownership is generally far faster than other solutions.)

Configuration: We already have many configuration management systems, with many problems[1]. This is just a key/value store, and as such is not as immediately portable to arbitrary services as existing approaches such as "bunch-of-files", instead requiring overhead for each service launched in order to make it function with to this configuration model.

The use of the newer raft consensus algorithm is interesting, but consensus does not a high availability cluster make. You also need elements like formal inter-service dependency definition in order to have any hope of automatically managing cluster state transitions required to recover from failures in non-trivial topologies. Corosync/Pacemaker has this, Consul doesn't. Then there's the potential split-brain issues resulting from non-redundant communications paths... raft doesn't tackle this, as it's an algorithm only. Simply put: given five nodes, one of which fails normally, if the remaining four split in equal halves who is the legitimate ruler? Game of thrones.

As peterwwillis pointed out, for web-oriented cases, the same degree of architectural flexibility and failure detection proposed under consul can be achieved with significantly reduced complexity using traditional means like a frontend proxy. For other services or people wanting serious HA clustering, I would suggest looking elsewhere for the moment.

[1] http://stani.sh/walter/pfcts

agentS12y ago

For the record, Raft does mandate what happens in that split-brain scenario. Neither side will be able to elect a leader, and writes will halt. Electing a leader requires a quorum.

contingencies12y ago

OK, thanks for clarifying. Still, the point being illustrated was that a drop-in solution is rarely feasible... ie. the common HA cluster feature of redundant link-layer communications paths does add significant protection against availability loss such as the situation you describe. It's not just a software thing.

nemothekid12y ago

I'm VERY impressed, even more impressed by the fact that it speaks DNS. I do with however that it came with a "driver" option rather running a consul client (or even just SkyDNS-like http option, although I'm unsure how you would manage membership). That way you could just "include" consul in your python/ruby/go application, and not have to worry about adding another service to your chef/pupper config and running yet another service.

justinfranks12y ago

Consul really solves a large problem for most SaaS companies who run or plan to run a Hybrid Cloud, Multi-cloud, or Multi-data center environment

opendais12y ago

This is slightly off topic but I'm curious why none of the service discovery tools run off of something like Cassanandra as the datastore?

hardwaresofton12y ago

So I think here the problem isn't really the datastore, it's more the high-availability and discovery. The main bonus that Consul seems to be providing is maintaining a logical topology of your network without you doing much.

They do this by a using gossip-based protocol and a derivative of paxos called Raft. These two things work together to essentially have the servers that run your various services (whether api or db or cache or whatever) know about EACH OTHER.

The database they use is LMDB, but I think they chose that for lightness -- you could easily replace it with a local instance of cassandra, most likely.

Also, I'm assuming you don't mean switching to a centralized cassandra instance -- why you don't want to do that should be obvious (central point of failure).

opendais12y ago

Wouldn't a centralized Cassandra cluster be reliable enough to meet that need?

I've never had a cluster completely collapse on me unless things were already screwed up enough that Service Discovery was ultimately useless since nothing else would work.

It just seems to me that losing your datastore makes your services unusable...at which point 'discovering them' isn't really the issue. Instead, everyone wants to introduce another datastore you need to rely on that its loss == can't find anyone. Even if your services themselves are still functional.

2 more replies

dantiberian12y ago

How does this differ from http://www.serfdom.io/, another HashiCorp product?

auganov12y ago

http://www.consul.io/intro/vs/serf.html

stormbrew12y ago

"Serf is a service discovery and orchestration tool..."

"... However, Serf does not provide any high-level features such as service discovery..."

Hm...

1 more reply

allengeorge12y ago

They operate at different levels of abstraction. It appears like Consul offers a superset of the features that Serf does. It also includes a consistent key-value store based on Raft, which is also really cool.

Axsuul12y ago

Can anyone care to provide some real world examples? I'm having a hard time wrapping my head around what this exactly does.

mikebabineau12y ago

Service Discovery is about helping services find one another.

If your application relies on memcached, you need to pass the memcached location to your application somehow. For simple architectures, this may just be a hardcoded localhost:11211.

As you scale, it becomes prudent to distribute services across different servers. Your configuration could then become something like "server1.mycompany.com:11211". But what if memcached moves from server1 to server2? You'll need to reconfigure and restart your application.

More sophisticated apps will often use a dynamic approach: services are registered with something like ZooKeeper or etcd. When serviceA needs to talk to serviceB, serviceA looks up serviceB's address in the service registry (or a local cache) and makes the request.

The good news is that these often include basic health check functionality, so you get a bit of fault tolerance for free. Unfortunately, this requires services to integrate directly with ZooKeeper or etcd, adding undesired complexity.

Some architectures therefore choose to use DNS as their service registry. But instead of hardcoding a the DNS address of a single node (like "server1.mycompany.com"), they hit an address associated with the service (serviceB.mycompany.com). This usually means rolling your own system to keep DNS up to date (adding/pruning in context of health state).

Consul is a hybrid approach. It allows you to use DNS as a service registry, but operates as its own, distributed DNS server cluster. Think of it like a specialized ZooKeeper cluster that exposes service information via DNS (and HTTP, if you prefer).

Back to the memcached case. With Consul, you'd point your app at "memcached.consul:11211". If your memcached server fell over and was replaced, Consul would pick up the change and return the new address. And without any app config changes or restarts.

lobster_johnson12y ago

I am trying to figure out how Consul meshes with a configuration management system such as Puppet. There is a lot of overlap.

From what I can tell, Consul supports two registration mechanisms: Static defined services in /etc/consul.d, and dynamically defined services through the HTTP API.

For the statically-defined case, for any given node, you have to create Puppet (or Chef, or whatever) definitions that populate /etc/consul.d with the stuff that's going to run on that node. For the actual configuration itself, you still want Puppet to be the one to populate it. The question then is what you gain by doing this; if that configuration goes into Puppet, then Puppet is still the main truth where you want to centralize things, so then you have this flow of data:

    client <- DNS <- Consul <- /etc/consul.d <- Puppet

...compared to the "old" way:

    client <- /srv/myapp/myapp.conf-or-whatever <- Puppet

In this case, Consul's benefit comes from the fact that it can know which services are alive and not, so that when myapp needs otherapp, it doesn't need a load-balancer to figure that out.

The documentation makes a point about Puppet updates being slow and unsynchronized, and it's true that you get into situations where, for example, service A is configured with hosts that aren't up yet, for example. With Consul you can update the config "live"; surely you want to centralize config in Puppet and populate Consul's K/V from Puppet, and then you get the single-point-of-update synchronization missing from Puppet, but you still need to store the truth in Puppet.

So I'm counting two good, but not altogether mind-blowing benefits from using Consul with Puppet, over not using Consul at all. The overlap is looking a lot like two systems vaguely competing for dominance.

I suspect the better use of Consul is in conjunction with something like Docker, where you ditch Puppet altogether (except as a way to update the host OS), and instead build images of apps and services that don't contain any configuration at all, but simply point themselves at Consul. That means that when you bring up a new Docker container, it can start its Consul agent, register its services, and suddenly its contained services are dynamically available to the whole cluster.

The container itself contains no config, no context, just general-purpose application/service code; and Consul doesn't need to be populated through Puppet because in that way, Consul is (in conjunction with some container provisioning system) the application world's Puppet.

That, to me, sounds pretty nice.

djb_hackernews12y ago

Should something be happening with the bar data payload in the HTTP kv example? Or is the value encoded for some reason?

armon12y ago

The value is base64 encoded, since JSON doesn't play nicely with binary values

peterwwillis12y ago

I'll preface these comments by saying that Consul appears to be the first distributed cluster management tool i've seen in years that gets pretty much everything right (I can't tell exactly what their consistency guarantees are; I suppose it depends on the use case?).

What I will say, in my usually derisive fashion, is I can't tell why the majority of businesses would need decentralized network services like this. If you own your network, and you own all the resources in your network, and you control how they operate, I can't think of a good reason you would need services like this, other than a generalized want for dynamic scaling of a service provider (which doesn't really work without your application being designed for it, or an intermediary/backend application designed for it).

Load balancing an increase of requests by incrementally adding resources is what most people want when they say they want to scale. You don't need decentralized services to provide this. What do decentralized services provide, then? "Resilience". In the face of a random failure of a node or service, another one can take its place. Which is also accomplished with either network or application central load balancing. What you don't get [inherently] from decentralized services is load balancing; sending new requests to some poor additional peer simply swamps it. To distribute the load amongst all the available nodes, now you need a DHT or similar, and take a slight penalty from the efficiency of the algorithm's misses/hits.

All the features that tools like this provide - a replicated key/value store, health checks, auto discovery, network event triggers, service discovery, etc - can all be found in tools that work based on centralized services, while remaining scalable. I guess my point is, before you run off to your boss waving an iPad with Consul's website on it demanding to implement this new technology, try to see if you need it, or if you just think it's really cool.

It's also kind of scary that the ability of an entire network like Consul's to function depends on minimum numbers of nodes, quorums, leaders, etc. If you believe the claims that the distributed network is inherently more robust than a centralized one, you might not build it with fault-tolerant hardware or monitor them adequately, resulting in a wild goose chase where you try to determine if your app failures are due to the app server, the network, or one piece of hardware that the network is randomly hopping between. Could a bad switch port cause a leader to provide false consensus in the network? Could the writes on one node basically never propagate to its peers due to similar issues? How could you tell where the failure was if no health checks show red flags? And is there logging of the inconsistent data/states?

mitchellh12y ago

Thanks for prefacing the constructive criticism with the compliment. :)

I want to clarify: Of all the buzz words Consul has, one thing Consul ISN'T is decentralized. You must run at least one Consul server in a cluster. If you want a fully centralized approach, you can just run one server. No big deal. Of course, if that server goes down, reads/writes are unavailable. If you want high availability, you run multiple servers. They leader elect to determine who will handle the writes but that is about it.

It is "decentralized" in that you can send read/writes to any server, but those servers actually just forward the requests onto the leader.

peterwwillis12y ago

Oh, well i'm a giant ass then. I saw "completely distributed" and extrapolated decentralized from that. Still, an automatically-[randomly?]-elected leader in a distributed network is a form of decentralized network. It's effectively a toroidal network topology, since your datacenters make it multi-dimensional (though now that I look at hypercubes i'm not exactly sure which is a better fit here)

Now that i've re-read your architecture page, let me see if I understand this: the basic point behind using Consul is to have multiple servers agree on the result of a request, and communicate that agreement to a single node to write it, and then return it to the client. So really it's a fault-tolerant messaging platform that includes features that take advantage of such a network; do I have that right?

Also, your docs say there are between three and five servers, but here you're saying you only need one?

1 more reply

j / k navigate · click thread line to collapse

57 comments

avitzurel12y ago

Those guys are machines.

Usually when people release open source software, the documentation is lacking, there's no website etc... those guys absolutely nail it every single time.

Kudos for them, really!

samstokes12y ago

Registered services and nodes can be queried using both a DNS interface as well as an HTTP interface.

mikebabineau12y ago

See also SkyDNS, another service discovery system: http://blog.gopheracademy.com/skydns

In common with Consul:

* DNS interface

* Operates as a distributed cluster

* Uses Raft for consensus

armon12y ago

SkyDNS is compared to Consul here: http://www.consul.io/intro/vs/skydns.html

allengeorge12y ago

stormbrew12y ago

mitchellh12y ago

stormbrew12y ago

It would be pretty awesome if you could get some case studies out about these deployments of serf. I would really like to hear more about them.

armon12y ago

By keeping the tools separate we give developers and operators two different tools. Sometimes you need a hammer, and sometimes a screwdriver will do.

This page compares the two: http://www.consul.io/intro/vs/serf.html

addisonj12y ago

Very impressed.

noelwelsh12y ago

MechanicalTwerk12y ago

Seriously, who does design for HashiCorp? Their site designs, though similar, always kill it.

cheeseprocedure12y ago

Agreed! I also appreciate that they tend to drop with robust documentation pages already in place.

MechanicalTwerk12y ago

hardwaresofton12y ago

This is really awesome, distributed system techniques in the real world. I'm really jealous of what they've managed to build.

I was planning to make a tool like this (smaller scale, one machine), and this will certainly serve as a good guide on how to do it right (or whether I should even bother at all).

armon12y ago

We didn't manage to finish it in time, but it should be released in the next two weeks or so!

hardwaresofton12y ago

I can only imagine how awesome it's going to be, given the excellent design (simple, readable, content-focused) of the other work you guys are doing.

1 more reply

igor4712y ago

an http interface might be safer, but then you have to build a connector for this into every one of your apps.

however, this definitely deserves a look if you're interested in, for instance, load-balancing UDP traffic

armon12y ago

sagichmal12y ago

The underlying Raft implementation is brand new, and looks much improved on the goraft used by etcd. Very impressed.

allengeorge12y ago

This is really impressive - kudos! I'm jealous - these guys are implementing extremely cool stuff in the distributed systems arena :) (serf - http://serfdom.io - comes to mind)

How much time did it take to put this together?

Loic12y ago

Short question: Can I define the IP of the service in the service definition?

[0]: http://www.consul.io/docs/agent/services.html

Update: An external service is possible: http://www.consul.io/docs/guides/external.html

contingencies12y ago

This sounds very impressive, at the risk of breaking the chorus of awesome: what problem does this actually solve?

[1] http://stani.sh/walter/pfcts

agentS12y ago

For the record, Raft does mandate what happens in that split-brain scenario. Neither side will be able to elect a leader, and writes will halt. Electing a leader requires a quorum.

contingencies12y ago

nemothekid12y ago

justinfranks12y ago

Consul really solves a large problem for most SaaS companies who run or plan to run a Hybrid Cloud, Multi-cloud, or Multi-data center environment

opendais12y ago

This is slightly off topic but I'm curious why none of the service discovery tools run off of something like Cassanandra as the datastore?

hardwaresofton12y ago

The database they use is LMDB, but I think they chose that for lightness -- you could easily replace it with a local instance of cassandra, most likely.

Also, I'm assuming you don't mean switching to a centralized cassandra instance -- why you don't want to do that should be obvious (central point of failure).

opendais12y ago

Wouldn't a centralized Cassandra cluster be reliable enough to meet that need?

I've never had a cluster completely collapse on me unless things were already screwed up enough that Service Discovery was ultimately useless since nothing else would work.

2 more replies

dantiberian12y ago

How does this differ from http://www.serfdom.io/, another HashiCorp product?

auganov12y ago

http://www.consul.io/intro/vs/serf.html

stormbrew12y ago

"Serf is a service discovery and orchestration tool..."

"... However, Serf does not provide any high-level features such as service discovery..."

Hm...

1 more reply

allengeorge12y ago

Axsuul12y ago

Can anyone care to provide some real world examples? I'm having a hard time wrapping my head around what this exactly does.

mikebabineau12y ago

Service Discovery is about helping services find one another.

If your application relies on memcached, you need to pass the memcached location to your application somehow. For simple architectures, this may just be a hardcoded localhost:11211.

lobster_johnson12y ago

I am trying to figure out how Consul meshes with a configuration management system such as Puppet. There is a lot of overlap.

From what I can tell, Consul supports two registration mechanisms: Static defined services in /etc/consul.d, and dynamically defined services through the HTTP API.

    client <- DNS <- Consul <- /etc/consul.d <- Puppet

...compared to the "old" way:

    client <- /srv/myapp/myapp.conf-or-whatever <- Puppet

In this case, Consul's benefit comes from the fact that it can know which services are alive and not, so that when myapp needs otherapp, it doesn't need a load-balancer to figure that out.

That, to me, sounds pretty nice.

djb_hackernews12y ago

Should something be happening with the bar data payload in the HTTP kv example? Or is the value encoded for some reason?

armon12y ago

The value is base64 encoded, since JSON doesn't play nicely with binary values

peterwwillis12y ago

mitchellh12y ago

Thanks for prefacing the constructive criticism with the compliment. :)

It is "decentralized" in that you can send read/writes to any server, but those servers actually just forward the requests onto the leader.

peterwwillis12y ago

Also, your docs say there are between three and five servers, but here you're saying you only need one?

1 more reply

j / k navigate · click thread line to collapse