Consensus Is Harder Than It Looks (2020) (opens in new tab)

(brooker.co.za)

97 pointskeegancsmith5y ago31 comments

31 comments

Cache invalidation, one of the so-called hardest things in software, is simply a consensus problem on the lifetime of data. Naming things is consensus on definitions.

Consensus isn’t just hard, it’s the hardest thing. Possibly even the only hard thing, if you include the social aspects of software development too (which involve human consensus).

hinkley5y ago

Author spends some time on determinism, which got me thinking of CRDTs, which are trying to make state changes commutative. They are in effect trying to get from linearalized state changes to causality. If we find a way to solve that problem, we have a more efficient consensus than Raft, one that can withstand small gaps in availability.

I keep waiting for someone to bring a Turing, Gödel, or Shannon into this and point out it’s not computable.

That worry aside, this thought process also brought me to monoids, which are used to share state in some functional languages. I’m curious how much information about concurrent state change is locked up in that space that people trying to solve the general problem don’t have ready access to.

mjb5y ago

Are you familiar with Joe Hellerstein and Peter Alvaro's work in this space? CALM provides something of a unifying theory of mergeable operations, of which CRDTs are an "object orientated" special case: https://arxiv.org/pdf/1901.01930.pdf

In practice, designers choose coordination-heavy protocols like consensus for a number of reasons. One is because writes don't or can't be merged. Operations as basic as simple assignment (x = 1;) can't be merged, so that's very real. Another is because readers can't tolerate weak consistency, because their business logic needs to make decisions at a particular point in time.

You're right that the thinking behind CRDTs (and CALM) is useful in reasoning through determinism in this context. The determinism problem, though, is easier than the general monotonicity problem, because only determinism is required and not associativity or commutativity.

_vvhw5y ago

"we have a more efficient consensus than Raft,"

Except that in the Multi-Paxos family of consensus and replication protocols, Raft is probably the least efficient. It's to the extreme and foregoes several simple optimizations, because of design decisions taken in the name of understandability and simplicity.

I would suggest that VR is at least as understandable as Raft, while also being more efficient and with the lowest latency for leader election in the common case. There's no need for Raft's carefully tuned random timeouts because split votes are not possible with VR. VR also lets you do some really cool things like warm up the next leader, or prioritize synchronous replication under Flexible Paxos to the next leader with asynchronous replication amongst the remaining followers, or decide who you want the next leader to be to optimize for geographical placement etc.

VR's view change protocol for leader election is also entirely in-memory so it's far more fault tolerant compared to Raft if you have a storage fault model and not only a network fault model. For example, Raft requires strong persistence guarantees for correctness of the leader election phase. If anything goes wrong with your disk (a ghost write or misdirected write) then Raft's implementation as written would be unsafe. Raft also has liveness issues if all nodes have even a single block failure at any point in their log.

If you're going to reach for a consensus algorithm, there are a lot of good reasons to do a survey of the literature first. There's a whole spectrum to choose from.

mjb5y ago

Joran, good to see your name come up!

> VR's view change protocol for leader election is also entirely in-memory so it's far more fault tolerant compared to Raft if you have a storage fault model and not only a network fault model.

Only in the fail-stop model, right? Or does this property extend to other models (like omissions)?

1 more reply

danbruc5y ago

If we find a way to solve that problem [...]

We will not. And you are hinting at the reason yourself.

[...] CRDTs, which are trying to make state changes commutative.

But this is slightly wrong, when you are using CRDTs you are not trying to make state changes commutative, you are limiting the allowed state changes to commutative ones. But some state changes are inherently non-commutative and if your system requires those, then you can not build it with CRDTs.

FAANG_dream5y ago

Ohh for sure!

I tried implementing Raft, which is supposed to be the most understandable out of all available consensus algorithms, but wasn't able to make it work 100%.

I came close but in the end gave up on resolving some concurrency issues :/

kilotaras5y ago

I'm not sure if you are aware of it, but check out MIT 6.824 on opencourseware (pick 2020, not 2021).

In Lab 2 they provide you with a frame for raft with RPC, etc. already in place leaving only protocol for you. Lab is also split into 3 logical parts - leader election, append entries, persistence. Highly recommend.

ignoramous5y ago

You are not alone: https://news.ycombinator.com/item?id=23123701 and https://youtu.be/QVvFVwyElLY?t=2502

Like TFA points out, distributed consensus is punishingly hard. AWS relied on TLA+ to prove consensus in DynamoDB and other systems https://lamport.azurewebsites.net/tla/formal-methods-amazon....

Interestingly, Kinesis [0] and SQS (?) avoid consensus for those same reasons.

[0] https://news.ycombinator.com/item?id=25239100

mjb5y ago

Chain Replication (and friends) are vastly simpler than Paxos (and friends) in many ways, but do have the same requirement for determinism. That's because chain replicated systems typically need to be confluent (see https://pathelland.substack.com/p/dont-get-stuck-in-the-con-...), which means that all the replicas need to have the same value in them when replication is done. Conceptually simpler, for sure, but many of the same challenges remain.

ZephyrBlu5y ago

How did you know there were still concurrency issues? I would expect issues like that to be relatively subtle.

Did you build out a test suite?

jerf5y ago

Heh... knowing for certain that you've gotten all the concurrency issues is hard. Knowing that a specific code base has concurrency issues can be very easy, by virtue of them clearly exhibiting bugs. I haven't tried to implement RAFT myself but I've certainly had code bases that clearly had concurrency bugs in them, even if I didn't know exactly what they were. :)

eternalban5y ago

The author mentions in passing "Virtual Synchrony" . It turns out it is used in some highly critical systems via ISIS/VSync framework from Cornel:

The "CORBA Fault Tolerant Objects standard" is based on the virtual synchrony model. Virtual synchrony was also used in developing the New York Stock Exchange fault-tolerance architecture, the French Air Traffic Control System, the US Navy AEGIS system, IBM's Business Process replication architecture for WebSphere and Microsoft's Windows Clustering architecture for Windows Longhorn enterprise servers.

https://en.wikipedia.org/wiki/Virtual_synchrony#Virtual_sync...

https://en.wikipedia.org/wiki/Vsync_(computing)

_vvhw5y ago

I am implementing a Multi-Paxos variant called Viewstamped Replication (http://pmg.csail.mit.edu/papers/vr-revisited.pdf) for TigerBeetle (https://github.com/coilhq/tigerbeetle) and keeping notes along the way to help other implementors down the line.

Some things I'm finding so far:

* As developers, we're used to thinking of services in terms of streaming TCP connections and RPCs. You send a request on a connection and get a response back on the same connection. However, distributed consensus algorithms (or at least their authors) like to think and write in terms of messages and message passing and the classic Actor pattern. For example, it's not uncommon for a consensus client to send a message to a leader but then get the ACK back from another server, a subsequently elected leader. That's at odds with the networking protocol we're used to. It's not always easy to shoehorn a consensus protocol onto a system that already has a TCP oriented design. Embrace message passing and multi-path routing.

* We're familiar with Jepsen. The network fault model is front of mind (dropped/delayed/replayed/corrupted messages, partitions, asymmetrical network topologies and performance). We're far less wary of the storage fault model: latent sector errors (EIO), silent bit rot, misdirected writes (writes written by firmware to the wrong sector), corrupt file system metadata (wrong journal file size, disappearing critical files), kernel page cache coherency issues (marking dirty pages clean after an fsync EIO), confusing journal corruption for a torn write after power failure.

* We underestimate the sheer bulk of the code we need to write to implement all the components of a practical consensus protocol correctly (a consensus replica to run the protocol at each node, a write ahead journal for storage, a message bus for in-process or remote messaging, a state machine for service up calls). The consensus protocol invariants are tough but limited, but the amount of code required to be written for all these components is brutal and there are so many pitfalls along the way. For example, when you read from your write ahead journal at startup and you find a checksum mismatch, do you assume this is because of a torn write after power failure as ZooKeeper and LogCabin do? What if it was actually just bit rot halfway through your log? How would you change your write ahead journal to disentangle these?

* We tend to think of the correctness of any given consensus function as binary, and fail to appreciate the broad spectrum of safety requirements required for specific components of the consensus algorithm. In other words, we don't always take fully to heart that some consensus messages are more critical than others. For example, we might resend an ACK to the leader if we detect (via op number) that we've already logged the prepare for that op number. However, most implementations I've seen neglect to assert and double-check that we really do have exactly what the leader is asking us to persist before we ACK. It's a simple verification check to compare checksums before skipping the journal write and acking the duplicate prepare and yet we don't.

* Another example, when we count messages from peers to establish quorum during leader election, we might count these messages without applying all the assertions we can think of on them. For example, are we asserting that all the messages we're counting are actually for the same leader election term? Or did we simply assume that we reset the array of messages being counted during the appropriate state transition sometime back in the past? The former is a much stronger guarantee, because it keeps you from double-counting stale leader election messages from past election phases, especially if these were successive (e.g. multiple rounds of elections because of split votes with no successful outcome). We should rather assume that the array we store these messages in, and that we're counting, could contain anything, and then assert that it contains exactly what we expect.

* Our intuition around fault tolerance might suggest that local storage faults cannot propagate to destroy global consensus. Yet they do (https://www.youtube.com/watch?v=fDY6Wi0GcPs). We need to be really careful how we repair local faults so that we do so correctly in the context of the global consensus protocol.

* Finally, I think what also really helps is to have a completely deterministic consensus protocol Replica abstraction that you initialize with an abstract Message Bus, Journal and State Machine instance. This Replica instance can send messages to in-process or remote Replica instances, and has on_message() handlers for the various protocol messages that either change state and/or send messages but can never fail (i.e. no error union return type) because that amplifies the dimensionality of the code paths. For timeouts, don't use the system clock because it's not deterministic. Instead, use a Timeout abstraction that you step through by calling tick() on the Replica. With these components in place, you can build an automated random fuzzing test to simulate your distributed network and local storage fault models and test invariants along the way, outputting a deterministic seed to reproduce any random failures easily.

eternalban5y ago

> I am implementing a Multi-Paxos variant called Viewstamped Replication

VR is -not- a variation of Paxos much less the later multi-paxos.

Viewstamped Replication was developed independently from Paxos and is distinct from Paxos. (And it came out a year before Paxos):

From the author of this OP:

https://brooker.co.za/blog/2014/05/19/vr.html

"Introduced in May 1988 in Brian Oki's PhD thesis, Viewstamped Replication predates the first publication of Paxos by about a year. If you're looking for intrigue you may be disappointed: both Lamport and Liskov claim the inventions were independent."

_vvhw5y ago

VR was of course developed independently by Barbara Liskov and Brian Oki and then revisited by Barbara Liskov with James Cowling.

However, it is common practice and perfectly acceptable to refer to VR as a variant of "Multi-Paxos" because that's actually EXACTLY what it is in theory, the protocol maps one-to-one, cf. Dan Ports (he explains it nicely here, see slide 46 if you want his bumper sticker version): https://courses.cs.washington.edu/courses/csep552/16wi/slide...

The more you understand VR and Multi-Paxos the more you will see that this is true.

In fact, and you may be surprised/disappointed at this, but Raft is also a variant of Multi-Paxos very similar to VR (cf. Heidi Howard https://groups.google.com/g/raft-dev/c/cBNLTZT2q8o), except with tighter restrictions on leader election that make it less efficient than VR, which is why we chose VR over Raft coincidentally.

By the way, that's a fantastic post by Marc Brooker and was part (along with references by Martin Thompson and Heidi Howard) of what made us pay more attention to VR in the first place.

1 more reply

MadSudaca5y ago

If we can't make machines reach a consensus, what hope is there for human beings?

bregma5y ago

Nine out of ten scientists ardently believe reaching consensus is harder than it looks. The remainder are unpersuaded and firmly believe that reaching consensus is easy.

hinkley5y ago

One of the things about Raft that bugs me is that it appears that the nominee counts the votes. And the leader is the only machine that approves transactions. If our favorite protocol is susceptible to first coups and then despots, then we have a long long way to go to model humans.

carapace5y ago

From the title I thought this might be about "Consensus decision-making" https://en.wikipedia.org/wiki/Consensus_decision-making

(I suspect it's the most efficient form of government, FWIW.)

miguelrochefort5y ago

Consensus Is Not Necessary (2021)

All problems stem from the misconception that consensus is necessary. When you realize that it's not, everything becomes much easier:

- Democracy vs Contractualism

- Bitcoin vs Holochain

- Censorship vs Filters

The same way prediction markets don't expect consensus on the interpreted state of the future, description markets shouldn't expect consensus on the interpreted state of the past.

Understand that objective reality is interpreted through a subjective lens.

Allow disagreements. Allow contradicting information to coexist. Allow people to be wrong.

We don't need to share the same perspective of the world. Embrace the filter bubble.

FridgeSeal5y ago

I don’t know whether you’re just making a statement or you weren’t aware, but this is about consensus in the technical sense; which is a case in which it is often necessary.

Social consensus is a whole different kettle of fish.

Predictive markets don’t need consensus, because they can be wrong. We have certain correctness expectations for other systems, and for those, reconciliation/fixing-it-after-the-fact can be more pain and effort than it’s worth.

bob10295y ago

Consensus is not typically necessary.

Consider a global scale e-commerce application. You may have a handful of operations that absolutely need to pull consensus across the entire system (i.e. adding/removing nodes, creating new accounts, authenticating and creating sessions).

Everything else could be left to a per-node basis without compromising the integrity of sensitive transactions.

If you look at this from a physics perspective, I think it gets really simple. You want to minimize your user request latency. More consensus = more participants = more latency.

If the grain of your consensus mechanism is your datastore and everything in your business requires pulling a majority of votes over the cluster, you cannot leverage this philosophy. I think the biggest mistake out there right now is trying to abstract this stuff under the datastore and not allowing the business logic to access it directly. Whether or not something needs to pull consensus should be a business question, not a technical question.

FridgeSeal5y ago

> Whether or not something needs to pull consensus should be a business question, not a technical question

By its nature it’s a technical question with some crossover into business realm. The problem of consensus is technical, hard, full of strange hard-to-reason-about occurrences and second/third order effects. The decision to be made is whether some business thing needs those guarantees is still a technical decision.

> he biggest mistake out there right now is trying to abstract this stuff under the datastore and not allowing the business logic to access it directly

I think the issue with this is that there’s not a lot to be gained by allowing this, it’s incredibly complex as very easy to get subtly wrong-and I’ve seen very competent and very average devs get stuck on things that are comparatively far easier. Consensus is really hard. A similarity might be when NoSQL databases were all the rage and everyone was throwing away relational databases and ACID, only to recreate them at app level from first principles, but worse. Putting it in a db doesn’t solve the problem for all cases of course, but I can understand why it’d solve a lot of them.

j / k navigate · click thread line to collapse

31 comments

hinkley5y ago

Cache invalidation, one of the so-called hardest things in software, is simply a consensus problem on the lifetime of data. Naming things is consensus on definitions.

Consensus isn’t just hard, it’s the hardest thing. Possibly even the only hard thing, if you include the social aspects of software development too (which involve human consensus).

hinkley5y ago

I keep waiting for someone to bring a Turing, Gödel, or Shannon into this and point out it’s not computable.

mjb5y ago

_vvhw5y ago

"we have a more efficient consensus than Raft,"

If you're going to reach for a consensus algorithm, there are a lot of good reasons to do a survey of the literature first. There's a whole spectrum to choose from.

mjb5y ago

Joran, good to see your name come up!

> VR's view change protocol for leader election is also entirely in-memory so it's far more fault tolerant compared to Raft if you have a storage fault model and not only a network fault model.

Only in the fail-stop model, right? Or does this property extend to other models (like omissions)?

1 more reply

danbruc5y ago

If we find a way to solve that problem [...]

We will not. And you are hinting at the reason yourself.

[...] CRDTs, which are trying to make state changes commutative.

FAANG_dream5y ago

Ohh for sure!

I tried implementing Raft, which is supposed to be the most understandable out of all available consensus algorithms, but wasn't able to make it work 100%.

I came close but in the end gave up on resolving some concurrency issues :/

kilotaras5y ago

I'm not sure if you are aware of it, but check out MIT 6.824 on opencourseware (pick 2020, not 2021).

ignoramous5y ago

You are not alone: https://news.ycombinator.com/item?id=23123701 and https://youtu.be/QVvFVwyElLY?t=2502

Like TFA points out, distributed consensus is punishingly hard. AWS relied on TLA+ to prove consensus in DynamoDB and other systems https://lamport.azurewebsites.net/tla/formal-methods-amazon....

Interestingly, Kinesis [0] and SQS (?) avoid consensus for those same reasons.

[0] https://news.ycombinator.com/item?id=25239100

mjb5y ago

ZephyrBlu5y ago

How did you know there were still concurrency issues? I would expect issues like that to be relatively subtle.

Did you build out a test suite?

jerf5y ago

eternalban5y ago

The author mentions in passing "Virtual Synchrony" . It turns out it is used in some highly critical systems via ISIS/VSync framework from Cornel:

https://en.wikipedia.org/wiki/Virtual_synchrony#Virtual_sync...

https://en.wikipedia.org/wiki/Vsync_(computing)

_vvhw5y ago

Some things I'm finding so far:

eternalban5y ago

> I am implementing a Multi-Paxos variant called Viewstamped Replication

VR is -not- a variation of Paxos much less the later multi-paxos.

Viewstamped Replication was developed independently from Paxos and is distinct from Paxos. (And it came out a year before Paxos):

From the author of this OP:

https://brooker.co.za/blog/2014/05/19/vr.html

_vvhw5y ago

VR was of course developed independently by Barbara Liskov and Brian Oki and then revisited by Barbara Liskov with James Cowling.

The more you understand VR and Multi-Paxos the more you will see that this is true.

By the way, that's a fantastic post by Marc Brooker and was part (along with references by Martin Thompson and Heidi Howard) of what made us pay more attention to VR in the first place.

1 more reply

MadSudaca5y ago

If we can't make machines reach a consensus, what hope is there for human beings?

bregma5y ago

Nine out of ten scientists ardently believe reaching consensus is harder than it looks. The remainder are unpersuaded and firmly believe that reaching consensus is easy.

hinkley5y ago

carapace5y ago

From the title I thought this might be about "Consensus decision-making" https://en.wikipedia.org/wiki/Consensus_decision-making

(I suspect it's the most efficient form of government, FWIW.)

miguelrochefort5y ago

Consensus Is Not Necessary (2021)

All problems stem from the misconception that consensus is necessary. When you realize that it's not, everything becomes much easier:

- Democracy vs Contractualism

- Bitcoin vs Holochain

- Censorship vs Filters

The same way prediction markets don't expect consensus on the interpreted state of the future, description markets shouldn't expect consensus on the interpreted state of the past.

Understand that objective reality is interpreted through a subjective lens.

Allow disagreements. Allow contradicting information to coexist. Allow people to be wrong.

We don't need to share the same perspective of the world. Embrace the filter bubble.

FridgeSeal5y ago

I don’t know whether you’re just making a statement or you weren’t aware, but this is about consensus in the technical sense; which is a case in which it is often necessary.

Social consensus is a whole different kettle of fish.

bob10295y ago

Consensus is not typically necessary.

Everything else could be left to a per-node basis without compromising the integrity of sensitive transactions.

If you look at this from a physics perspective, I think it gets really simple. You want to minimize your user request latency. More consensus = more participants = more latency.

FridgeSeal5y ago

> Whether or not something needs to pull consensus should be a business question, not a technical question

> he biggest mistake out there right now is trying to abstract this stuff under the datastore and not allowing the business logic to access it directly

j / k navigate · click thread line to collapse