Raft: Understandable Distributed Consensus (2014) (opens in new tab)

(thesecretlivesofdata.com)

235 pointsHrun01y ago87 comments

87 comments

I am in the minority who thinks Raft is overrated.

I tried teaching Raft one year instead of Paxos but ended up switching back. While it was much easier to understand how to implement Raft, I think my students gained deeper insight when focusing on single-decision Paxos. There is a lightbulb moment when they first understand that consensus is a property of the system that happens first (and they can point at the moment it happens) and then the nodes discover that it has been achieved later. Exploring various failure modes and coming to understand how Paxos is robust against them seems to work better in this setting as well.

I think this paper by Heidi Howard and Richard Mortier is a great way to move on to Multipaxos:

https://arxiv.org/abs/2004.05074

They present Multipaxos in a similar style to how Raft is laid out and show that Multipaxos as it is commonly implemented and Raft are almost the same protocol.

Raft was a great contribution to the engineering community to make implementing consensus more approachable, but in the end I don't think the protocol itself is actually more understandable. It was presented better for implementers, but the implementation focus obscures some of the deep insights that plain Paxos exposes.

sriram_malhar1y ago

I have had the opposite trajectory. Used to teach Paxos, but was so relieved to switch to Raft when the paper came out.

I discuss distributed data structures in the context of maps and sequences.

For maps, I discuss key-value stores (NUMA, Redis). I have them implement cache coherence (MESI protocol, TARDIS 2.0), then linearizable, fault-tolerant, wait-free shared memory registers (the Attiya/Bar-Noy/Dolev algorithm[1]).

For sequences, I cover shared logs and state machine replication, including database log shipping, Kafka, queues and Raft.

I like Raft because it cuts down the design space by making certain very intuitive and pragmatic choices, like using timeouts (which are almost beneath Lamport to discuss :), or idioms like "follow the leader", "if the leader is unreachable, stand for election", "elect the latest & most informed leader", (how I wish that was true in real life!), "always append" etc. There are simple mechanisms to preserve invariants.

The problem with Paxos is that there is such a large range of papers that there is no one paper that makes the leap in easy digestible chunks from Basic to MultiPaxos. When I got students to implement MultiPaxos, I never could get sufficient confidence that it was done right (esp. the "disorderly" filling in of log slots).

Paxos is like Monads; when you get it, you feel compelled to write a "Paxos explained" paper :)

[1]https://groups.csail.mit.edu/tds/papers/Attiya/PODC90.pdf

withinboredom1y ago

I feel like you are doing your students a disservice. (multi-)Paxos, while complex to wrap your head around, enables far more modes of consensus. The possibilities and papers out there are amazing.

Raft essentially only allows a single mode. Moreover, you are starting to see people putting things on top of Raft instead of something like Paxos, in the enterprise, because they don't know any better nor have the foundation to understand what they are doing is "wrong."

> When I got students to implement MultiPaxos, I never could get sufficient confidence that it was done right

Testing this is fairly straightforward, they should be able to join an already existing cluster. If they got it wrong, it shouldn't take down the cluster, and they should be able to step through their own code. There aren't any timeouts, so they can take their time, going through each step of the process until a value is committed.

At that point, you simply explain each step as an individual algorithm, not the sum of its parts. You can even build each part individually because an existing cluster should recover from a misbehaving peer.

From there, it is a rather simple visualization process to see what is going on.

The hard part of paxos is building it from scratch.

1 more reply

withinboredom1y ago

Know that I join you in Raft being overrated. I’m working on a multipaxos implementation right now. There are some really neat capabilities/properties that paxos has that Raft can never achieve (see wpaxos, for example, that lets keys migrate to nodes near the client).

hintymad1y ago

Is there any paper/handouts/video that explains Paxos in depth, especially its implementations and intuitions? Paxos Made Simple gave intuitive explanations, but I feel it still misses a lot of intricate details if I were to build Praxos for production use.

mananaysiempre1y ago

The papers and talks[1] by Heidi Howard and Richard Mortier, in particular “Paxos vs Raft: Have we reached consensus on distributed consensus?”[2,3] and “Flexible Paxos”[4,5], are what finally made things click for me. A real implementation also needs other stuff[6], though, such as dynamic membership and state machine replication, which I still don’t know how to do.

[1] https://fpaxos.github.io/

[2] https://dl.acm.org/doi/abs/10.1145/3380787.3393681 or https://arxiv.org/abs/2004.05074

[3] https://www.youtube.com/watch?v=0K6kt39wyH0

[4] https://doi.org/10.4230/LIPIcs.OPODIS.2016.25 or https://arxiv.org/abs/1608.06696

[5] https://www.youtube.com/watch?v=r6NG_1HM0lA

[6] https://github.com/heidihoward/distributed-consensus-reading...

nvarsj1y ago

The best approach is to implement Paxos.

I suggest https://github.com/emichael/dslabs.

It will take 100-200 hours to fully implement with all tests passing.

1 more reply

alexnewman1y ago

You are telling me your students can safely implement single decree paxos.... I worked on a few paxos production implementations before the RAFT paper, single and multi decree. The idea that paxos, as it is to be implemented, is easy for students to understand... Well let me re-read the paper, but i assure you, raft was a big deal

alexchamberlain1y ago

I've read both the Paxos and Raft papers a few times, and hacked on some implementations, but never quite got one over the line to working...

Raft strikes me as a particular set of decisions made within a Paxos framework, such as having 1 entity for Proposers, Acceptor and Followers. It's frustrating that there isn't a clearly written defacto paper on Paxos - the story style confused the monkeys out of me.

jimbokun1y ago

> but never quite got one over the line to working...

I've never implemented something like this. But my first thought is "how do you implement the testing system?"

I feel like once you had a robust testing system that can verify things work correctly in all the different network partition and other scenarios, and allowing rapid iteration of setting up those scenarios, the implementation would be comparatively easy.

2 more replies

weinzierl1y ago

This is an interesting insight into the educational side, but now I am curious about the implementation side. Raft is easier to implement but that's just one factor. Looking at real world usages there seems to be a draw. I could easily count as many Paxos implementations as Raft. Is this just historical or are there good reasons for a new project to still implement Oaxos?

bfdes1y ago

Paxos and Multi-Paxos have been around much longer than Raft. The paper that introduced Raft was published in 2014.

senderista1y ago

Agree, Raft is less modular and therefore harder to understand than MultiPaxos:

https://maheshba.bitbucket.io/blog/2021/12/14/Modularity.htm...

benbjohnson1y ago

Author here. I'm happy to answer any questions although this project was from 10+ years ago so I could be a little rusty.

Over the years I've been trying to find better ways to do this kind of visualization but for other CS topics. Moving to video is the most realistic option but using something like After Effects takes A LOT of time and energy for long-form visualizations. It also doesn't produce a readable output file format that could be shared, diff'd, & tweaked.

I spent some time on a project recently to build out an SVG-based video generation tool that can use a sidecar file for defining animations. It's still a work in progress but hopefully I can get it to a place where making this style of visualizations isn't so time intensive.

huntaub1y ago

I just want you to know how much this visualization was appreciated. In my time working at AWS, I recommended this website to every one of our new hires to learn how distributed consensus works. Know that this has taught probably 50+ people. Thank you for what you’ve built.

benbjohnson1y ago

Thanks so much for letting me know! It's always hard to tell when I put something out there if it just gets lost in the ether. I'm glad to hear it helped so many folks.

kfrzcode1y ago

What are your thoughts on Dr. Leemon Baird's Hedera Hashgraph?

https://www.swirlds.com/downloads/SWIRLDS-TR-2016-01.pdf

thesz1y ago

"Suppose Alice has hashgraph A and Bob hash hashgraph B. These hashgraphs may be slightly different at any given moment, but they will always be consistent. Consistent means that if A and B both contain event x, then they will both contain exactly the same set of ancestors for x, and will both contain exactly the same set of edges between those ancestors."

Consider UTXO-based events. There can be an event E1 that consumes UTXO1 and UTXO2 and event E2 that consumes UTXO2 and UTXO3. Hashgaphs that contain one of these events are consistent but their union is not. This can be used to perform some byzantine things, I can think of at least two of them: doublespend and degradation of service.

This paper is a clear example of how to make a thing that has no obvious problems.

benbjohnson1y ago

I haven't read that paper but it seems like it's fixing a different problem of Byzantine fault tolerance. Most consensus systems that are internal for an organization don't have the Byzantine issue so it simplifies the problem.

mgenglder1y ago

This is wonderful. Can I ask how you created it? Stack used and sour e code? I'd love to create something like this to help visualize things I'm working with currently.

benbjohnson1y ago

It's all done with d3 and JavaScript. The visualizations aren't deterministic so I ended up writing a shitty Raft implementation in JS. Overall it was a terrible approach because it was so time consuming but I made it work. You can find all the source code in this repo: https://github.com/benbjohnson/thesecretlivesofdata

MarkMarine1y ago

This is one of my favorite pieces of software engineering because it took something difficult and tried to design something easy to understand as a main criteria for success. The PHD Thesis has a lot more info about this if anyone is curious, it is approachable and easy to read:

https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pd...

I think this was core to Raft’s success, and I strive to create systems like this with understandability as a first goal.

throwawaymaths1y ago

Weirdly it's also kinda worse is better: raft is non-deterministic and has an unboundedly long election cycle time. IIRC:

- it assumes no hysteresis in network latencies and if there is a hysteresis it's possible that elections can be deterministically infinite.

- this fact and the use of raft in production has caused real, large scale network outages.

Paxos is of course a beast and hard to understand. There is an alternative, VSR (which was developed ~time of paxos) which is easy to understand and does not have the issues caused by election nondeterminism in raft.

Of course everyone uses raft so raft dominates.

eatonphil1y ago

> - this fact and the use of raft in production has caused real, large scale network outages.

While this has surely happened, I am not so confident about what the reasons were for this. If you've got links on details I'd love to read.

> which is easy to understand

I've implemented core bits of Raft twice now and have looked at VSR a couple of times and VSR wasn't easier for me to understand. I'm sure I could implement VSR and would like to some day, but just comparing the papers alone I personally felt like Raft was better presented (i.e. easier to understand).

Also keep in mind that nobody ships consensus implementations exactly in line with the original paper. There are dozens or hundreds of papers on variations and extensions of Raft/Paxos and every actual implementation is going to implement some selection of these extensions/variations. You have to look at each implementation carefully to know how it diverges from the original paper.

1 more reply

convolvatron1y ago

don't forget about calm. its a shame that that isn't the default we reach for, and only struggle when we really need stronger latency.

https://arxiv.org/abs/1901.01930

1 more reply

MarkMarine1y ago

> this fact and the use of raft in production has caused real, large scale network outages.

Paxos as well, I remember full cloud GCP outage that had something to do with Paxos, and I can’t find the data on it but I thought there was a nasty bug in zookeeper paxos implementation.

That isn’t to say any of these are perfect or bug free, it’s made by humans and we’re going to make mistakes, but my experience implementing both was I had a working raft implementation and paxos baked my brain until I gave up.

I think everyone uses raft _because_ it was possible to implement for a working dev, so there are a number of implementations, and it’s easier to understand the phases the application is in.

I’ll check out VSR I appreciate the rec.

1 more reply

kfrzcode1y ago

I've made a longer comment in another thread; but have you investigated the hashgraph algorithm? Gossip-about-gossip and virtual voting combine to result in leaderless consensus, fair ordering and aBFT. It's very performant with over 10k TPS on-chain. I'm learning about DLT from the perspective of hashgraph which is why I don't understand why it doesn't get love - it seems to have all of the good and none of the bad.

1 more reply

prydt1y ago

I've run a reading group for distributed systems for the last 2 years now and I do think that Raft is a better introduction to Consensus than any Paxos paper I have seen (I mean the Paxos Made Simple paper literally has bugs in it). But when I learned consensus in school, we used Paxos and Multi-Paxos and I do believe that there was a lot to be gained by learning both approaches.

Heidi Howard has several amazing papers about how the differences between Raft and Multi-Paxos are very surface level and that Raft's key contribution is its presentation as well as being a more "complete" presentation since there are so many fragmented different presentations of Multi-Paxos.

As a bonus, one of my favorite papers I have read recently is Compartmentalized Paxos: https://vldb.org/pvldb/vol14/p2203-whittaker.pdf which is just a brilliant piece on how to scale Multi-Paxos

senderista1y ago

There are several Multi-Paxos papers (some of them dating before Raft) that are intended as guidance for implementers:

https://paper-notes.zhjwpku.com/assets/pdfs/paxos_for_system...

https://www.cs.cornell.edu/home/rvr/Paxos/paxos.pdf

https://www.scs.stanford.edu/~dm/home/papers/paxos.pdf

prydt1y ago

Ah thank you. That is a good list although I personally dislike the "Paxos Made Moderately Complex" paper... I think it adds too many different roles for very little benefit. When implementing multi-Paxos for class, I used that paper and felt it was more trouble than it needed to be.

I'll check out the other two papers though! Also just looking around and I found this paper https://arxiv.org/pdf/1103.2408 [PDF] which looks useful as well.

nano_o1y ago

The bug in the Paxos Made Simple paper is that Lamport forgot to mention that, upon accepting a proposal, an acceptor also implicitly promises not to accept any proposal in lower ballots. It's discussed at length here: https://stackoverflow.com/questions/29880949/contradiction-i...

shepherdjerred1y ago

What’s your reading group?

I took a DS class and (poorly) implemented Paxos a few years ago. I’m curious about how others continue learning about DS.

mbivert1y ago

In case this is of interest, MIT's 6.5840[0], distributed systems, has a series of labs, implementing Raft in Go. Haven't made it through the whole thing yet, but it's quite entertaining so far.

The teachers provide you with some code templates, a bunch of tests, and a progressive way to implement it all.

[0]: https://pdos.csail.mit.edu/6.824/index.html

lifeinthevoid1y ago

really liked the course, it was also instrumental in landing me my previous job :-)

skilning1y ago

I was asked to click "continue" after each of the first two sentences, and the fade-in of the text took longer than reading the text.

This may be a great article, but I'll never know because it's frustrating to try and read.

dang1y ago

Raft Consensus Animated (2014) - https://news.ycombinator.com/item?id=32484584 - Aug 2022 (67 comments)

Raft Visualization - https://news.ycombinator.com/item?id=25326645 - Dec 2020 (35 comments)

Raft: Understandable Distributed Consensus - https://news.ycombinator.com/item?id=8271957 - Sept 2014 (79 comments)

eatonphil1y ago

Ben's visualization here is great.

The other biggest help to me aside from the paper and the thesis was Ongaro's TLA+ spec: https://github.com/ongardie/raft.tla/blob/master/raft.tla. It's the only super concise "implementation" I found that is free of production-grade tricks, optimizations, and abstractions.

And for building an intuition, TigerBeetle's sim.tigerbeetle.com is great. What happens to consensus when there's high latency to disk or network? Or as processes crash more frequently? It demonstrates.

pa7ch1y ago

Interesting that TigerBeetle uses Viewstamp Replication over Paxos/Raft. TB says viewstamp replication lends itself to a more performant implementation and doesn't rely on disk storage as much. I'm surprised I'm not seeing this brought up more in Paxos/Raft discussions.

butterisgood1y ago

VR is often overlooked and really pretty easily understood

ko_pivot1y ago

The writing and the visualizations are great. The ‘continue’ button is way too frequent.

quelltext1y ago

In the log replication example, after healing the partition the uncommitted log changes in the minority group are rolled back and the leader's log is used.

However it's not clear how that log is transmitted. Until this point only heartbeats via append entry were discussed, so it's not clear if the followers pull that information from the leader somehow via a different mechanism, or whether it's the leader's responsibility to detect followers that are left behind and replay everything. That would seem rather error prone and a lot of coordination effort. So how's it actually done?

jhanschoo1y ago

Just earlier this month I was going through https://github.com/jepsen-io/maelstrom and following the demo implementing (a cut-corners version of) Raft, and I found it quite elucidating. The sample (ruby) code contains some bugs, and I had to use some understanding to fix them. (The bugs were of the kind where a dummy implementation from an earlier step isn't correctly changed)

shiredude951y ago

"Paxos Made Moderately Complex" by Robert van Renesse and Deniz Altinbuken: http://www.cs.cornell.edu/courses/cs7412/2011sp/paxos.pdf is a great starting point for implementing multi-paxos. The authors also provide a working python implementation.

eatonphil1y ago

Here's a visualization of Paxos and MultiPaxos that seems to be based on Ben's work: https://visual.ofcoder.com/.

kfrzcode1y ago

DLT technology discussions are entirely incomplete without consideration of Hedera Hashgraph [0], an aBFT, leaderless, fair and fast DLT using a gossip-about-gossip consensus mechanism. It's absolutely a more robust and scalable technology than Paxos or any other DLT for that matter. I'd love to know what the HN crowd thinks about Hedera as the trust layer of the internet but.... nobody around here seems to have any. It's like ignoring Linux while comparing Mac and Windows based computing.

[0]: https://www.swirlds.com/downloads/SWIRLDS-TR-2016-01.pdf

lapinot1y ago

I got interested into the hashgraph algorithm quite early and wrote a toy implementation in python (in fact discovering an error in the paper in the process). Unless something changed it's entirely useless for open-membership internet-scale consensus. As I remember, the processing time of a message at a node is linear in the number of nodes and same for the local storage at a node, meaning it's not very scalable. Moreover, the nodes must somehow a priori agree on the list of participants of the consensus process, again something which is not realistic for internet-wide consensus. The protocol is quite neat and not too hard to implement but it's similar in scope to paxos/raft: consensus inside an organization, where some things are a priori agreed upon.

ryanthemadone1y ago

Paxos isn't a DLT, it's a consensus algorithm — granted, DLTs tend to require a consensus algorithm, but they're not the same things.

As for Hedera Hashgraph being the trust layer of the internet, we tend to build the internet through the IETF and standards setting. Unfortunately HH is an endeavour from a private company so isn't especially likely to be taken on in that context.

I'd also wonder what you mean by the trust layer of the internet, what are the use cases that you'd like to see solved with such a trust layer?

kfrzcode1y ago

Great point, I wasn't being precise. My point stands, however! I must also correct your understanding re: "private company."

The DLT in question - Hedera - is built on the unique consensus algorithm, the "hashgraph". The hashgraph algorithm combines a gossip-about-gossip protocol with virtual voting.

In fact, services development is now in the hands of the largest open source foundation in the world. These implementations are entirely open source [0], and very recently the codebase has been donated in whole to the Linux Foundation.

Furthermore, Hedera is a Pioneer member of the newly founded Linux Foundation Decentralized Trust organization [1] - along with Chainlink, Deloitte, Hitachi, many other major organizations, some of whom are also on the Hedera governing council [2]. This foundation will be a big player in the future of decentralized web, and Hedera is the only L1 that I know of which is both primed for this future and actually scalable.

I understand the IETF and standards approach; and am not aware of a current draft or intention for such a draft. The idea of Hedera being the 'trust layer' of the internet is more about use cases like decentralized recovery, process validation, carbon offsets, extremely granular supply chain auditing, and any other application you might imagine that would benefit from having extremely fast (10k+ TPS), 100% guaranteed aBFT consensus on-chain. I'd love to hear what you might think up or where this could be particularly useful. Strong governance with 39 council members - including Google, IBM, Boeing, Tata, AP+, Hitachi and more... with decentralized network operation and stable fees (essential for enterprise application).

> Additional use cases include, but are not limited to, financial markets, matching engines (like those used in Uber or AirBnb), or supply chain negotiations (e.g., several competing factories bidding on parts from several competing parts suppliers).

So, admittedly the use cases maybe aren't evident or interesting to you and I right now at the TCP/IP layer, but I can certainly say there are a plethora of trust-based problems that could be solved with consensus only needing a few thousand ms. Think digital identity, healthcare, financial markets, IoT, supply chain, real-world asset tokenization... For any real-world, scaled application, a DLT must have very high throughput at high performance. It's the absolute highest performance and security possible in a leaderless consensus-based DLT as far as I know.

Literally carbon neutral or negative because of buybacks but even without Hedera buying carbon credits, it's the single most "green" i.e. power-efficient DLT on the market. Does HN still care about Bitcoin using too much power? They would like Hedera, to that extent. Predictable, very low fixed fees. Long list of the biggest tech players leading the open development process. What's not to love?

I urge you to help me invalidate these claims as it's pretty important I understand the tech here... But I'm very bullish on the token price as the technology is proven, robust, and overall extremely undervalued by retail - in my opinion. NFA.

> The Hashgraph consensus algorithm is an algorithm for asynchronous Byzantine fault tolerance intended for distributed shared ledgers. Its main distinguishing characteristic is it achieves consensus without exchanging any extra messages; each participant’s votes can be determined from public information, so votes need not be transmitted.

For more rigorous explanation, see [3] and the associated Coq proof of the algorithm [4].

[0]: https://github.com/hashgraph/hedera-services

[1]: https://www.lfdecentralizedtrust.org/

[2]: https://hedera.com/ecosystem/governing-council

[3]: https://hedera.com/papers

[4]: https://www.cs.cmu.edu/~crary/papers/2021/hashgraph.pdf

wg01y ago

Off topic - folks in the know, besides Paxos/Raft what are some of the other most complex algorithm in computer sciences that are widely used or are a bedrock?

withinboredom1y ago

I can think of a deceptively simple one: recursive descent parsers. They are pretty straightforward to implement but hard to read/reason about if you don't already know the grammar.

mbivert1y ago

I'd be curious to know as well; I'd bet on things pertaining to either compilers (e.g. optimization stuff), concurrency/OSes, graphics engines & image processing.

rapsey1y ago

While understandable, implementing it is however far from easy.

jimbokun1y ago

How do you write a testing framework/system for evaluating distributed algorithms like this?

Doing that in a way that's easy to automate, set up, evaluate, and tear down various scenarios seems like the hardest part to me.

1 more reply

ukd11y ago

Having implemented it twice (for fun/learning, though) I think you're right - yet it's also the easiest, imho. The paper, and resources around implementing it (posts, other implementations) are great. Also, the authors still reply on github when issues are opened, which is great.

MarkMarine1y ago

Right, but Paxos is double hard in comparison. I’ve read both papers multiple times, tried to implement and failed, and I still don’t think I understand Paxos.

mrkeen1y ago

I'm on the other side.

I think Leslie Lamport asserted that Paxos is minimal, and that "all other consensus algorithms are just Paxos with more steps". I'm inclined to believe him.

I've implemented Paxos but I can't get through "Raft for dummies" style blog posts.

Regarding Raft [1]:

  > The consensus problem is divided into three sub-problems: Leader election, Replication and Safety.

What is leader election? It's a distributed system coming to consensus on a fact (i.e. who the leader is.) Then once you have the leader, you do additional steps. The entirety of Paxos is a distributed system coming to consensus on a fact.

When I read these posts, i see things like "timeout", "heartbeat", and I think: timeout according to whom? I read "once the leader has been elected", um, hangon, according to whom? Has node 1 finally agreed on the leader, just while node 3 has given up and started another election? I don't doubt that Raft is correct, but the writing about it seems simple by glossing over details.

Paxos, on the other hand, seems timeless. (And the writing about it doesn't trigger my "distributed system fallacies" reaction)

[1] https://www.brianstorti.com/raft/

3 more replies

Vervious1y ago

Raft has a problem where, in the protocol description, sometimes I have no idea why some line is there, but if you take it out the protocol comes to a grinding halt... it's really a mumble jumble. It has good intuition, but the details are really messy, it's edge cases all the way down.

Whereas I think each line of pseudocode in Paxos is much more motivated.

In other words, if a philosopher had to design a crash-fault protocol from scratch, without having seen any before, I think 80% of the time it would look exactly like Paxos.

jeffbee1y ago

Students of raft say they understand it but when they have to implement it they make as many mistakes as students of paxos. The simplicity is misleading.

candiddevmike1y ago

IMO, Paxos has a lot less edge cases than Raft, mostly because of the complexity of the implementation covers them/forces you to think about how to handle them.

1 more reply

matthewfcarlson1y ago

This really teaches Raft well. Is there a good example of this but for Paxos?

cedws1y ago

Can't proof-of-work be used as a leader election algorithm? If the proof is hard enough to generate then one node should be able to generate one and broadcast it before the other nodes can, then that node becomes the leader.

withinboredom1y ago

There’s a paper about that, using paxos as the base. Can’t find it right now, it is called “chained paxos” or “block paxos” or something like that.

ryanthemadone1y ago

Proof of work is a leader election algorithm!

j / k navigate · click thread line to collapse

87 comments

_russross1y ago

I am in the minority who thinks Raft is overrated.

I think this paper by Heidi Howard and Richard Mortier is a great way to move on to Multipaxos:

https://arxiv.org/abs/2004.05074

They present Multipaxos in a similar style to how Raft is laid out and show that Multipaxos as it is commonly implemented and Raft are almost the same protocol.

sriram_malhar1y ago

I have had the opposite trajectory. Used to teach Paxos, but was so relieved to switch to Raft when the paper came out.

I discuss distributed data structures in the context of maps and sequences.

For sequences, I cover shared logs and state machine replication, including database log shipping, Kafka, queues and Raft.

Paxos is like Monads; when you get it, you feel compelled to write a "Paxos explained" paper :)

[1]https://groups.csail.mit.edu/tds/papers/Attiya/PODC90.pdf

withinboredom1y ago

I feel like you are doing your students a disservice. (multi-)Paxos, while complex to wrap your head around, enables far more modes of consensus. The possibilities and papers out there are amazing.

> When I got students to implement MultiPaxos, I never could get sufficient confidence that it was done right

From there, it is a rather simple visualization process to see what is going on.

The hard part of paxos is building it from scratch.

1 more reply

withinboredom1y ago

hintymad1y ago

mananaysiempre1y ago

[1] https://fpaxos.github.io/

[2] https://dl.acm.org/doi/abs/10.1145/3380787.3393681 or https://arxiv.org/abs/2004.05074

[3] https://www.youtube.com/watch?v=0K6kt39wyH0

[4] https://doi.org/10.4230/LIPIcs.OPODIS.2016.25 or https://arxiv.org/abs/1608.06696

[5] https://www.youtube.com/watch?v=r6NG_1HM0lA

[6] https://github.com/heidihoward/distributed-consensus-reading...

nvarsj1y ago

The best approach is to implement Paxos.

I suggest https://github.com/emichael/dslabs.

It will take 100-200 hours to fully implement with all tests passing.

1 more reply

alexnewman1y ago

alexchamberlain1y ago

I've read both the Paxos and Raft papers a few times, and hacked on some implementations, but never quite got one over the line to working...

jimbokun1y ago

> but never quite got one over the line to working...

I've never implemented something like this. But my first thought is "how do you implement the testing system?"

2 more replies

weinzierl1y ago

bfdes1y ago

Paxos and Multi-Paxos have been around much longer than Raft. The paper that introduced Raft was published in 2014.

senderista1y ago

Agree, Raft is less modular and therefore harder to understand than MultiPaxos:

https://maheshba.bitbucket.io/blog/2021/12/14/Modularity.htm...

benbjohnson1y ago

Author here. I'm happy to answer any questions although this project was from 10+ years ago so I could be a little rusty.

huntaub1y ago

benbjohnson1y ago

Thanks so much for letting me know! It's always hard to tell when I put something out there if it just gets lost in the ether. I'm glad to hear it helped so many folks.

kfrzcode1y ago

What are your thoughts on Dr. Leemon Baird's Hedera Hashgraph?

https://www.swirlds.com/downloads/SWIRLDS-TR-2016-01.pdf

thesz1y ago

This paper is a clear example of how to make a thing that has no obvious problems.

benbjohnson1y ago

mgenglder1y ago

This is wonderful. Can I ask how you created it? Stack used and sour e code? I'd love to create something like this to help visualize things I'm working with currently.

benbjohnson1y ago

MarkMarine1y ago

https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pd...

I think this was core to Raft’s success, and I strive to create systems like this with understandability as a first goal.

throwawaymaths1y ago

Weirdly it's also kinda worse is better: raft is non-deterministic and has an unboundedly long election cycle time. IIRC:

- it assumes no hysteresis in network latencies and if there is a hysteresis it's possible that elections can be deterministically infinite.

- this fact and the use of raft in production has caused real, large scale network outages.

Of course everyone uses raft so raft dominates.

eatonphil1y ago

> - this fact and the use of raft in production has caused real, large scale network outages.

While this has surely happened, I am not so confident about what the reasons were for this. If you've got links on details I'd love to read.

> which is easy to understand

1 more reply

convolvatron1y ago

don't forget about calm. its a shame that that isn't the default we reach for, and only struggle when we really need stronger latency.

https://arxiv.org/abs/1901.01930

1 more reply

MarkMarine1y ago

> this fact and the use of raft in production has caused real, large scale network outages.

Paxos as well, I remember full cloud GCP outage that had something to do with Paxos, and I can’t find the data on it but I thought there was a nasty bug in zookeeper paxos implementation.

I think everyone uses raft _because_ it was possible to implement for a working dev, so there are a number of implementations, and it’s easier to understand the phases the application is in.

I’ll check out VSR I appreciate the rec.

1 more reply

kfrzcode1y ago

1 more reply

prydt1y ago

As a bonus, one of my favorite papers I have read recently is Compartmentalized Paxos: https://vldb.org/pvldb/vol14/p2203-whittaker.pdf which is just a brilliant piece on how to scale Multi-Paxos

senderista1y ago

There are several Multi-Paxos papers (some of them dating before Raft) that are intended as guidance for implementers:

https://paper-notes.zhjwpku.com/assets/pdfs/paxos_for_system...

https://www.cs.cornell.edu/home/rvr/Paxos/paxos.pdf

https://www.scs.stanford.edu/~dm/home/papers/paxos.pdf

prydt1y ago

I'll check out the other two papers though! Also just looking around and I found this paper https://arxiv.org/pdf/1103.2408 [PDF] which looks useful as well.

nano_o1y ago

shepherdjerred1y ago

What’s your reading group?

I took a DS class and (poorly) implemented Paxos a few years ago. I’m curious about how others continue learning about DS.

mbivert1y ago

In case this is of interest, MIT's 6.5840[0], distributed systems, has a series of labs, implementing Raft in Go. Haven't made it through the whole thing yet, but it's quite entertaining so far.

The teachers provide you with some code templates, a bunch of tests, and a progressive way to implement it all.

[0]: https://pdos.csail.mit.edu/6.824/index.html

lifeinthevoid1y ago

really liked the course, it was also instrumental in landing me my previous job :-)

skilning1y ago

I was asked to click "continue" after each of the first two sentences, and the fade-in of the text took longer than reading the text.

This may be a great article, but I'll never know because it's frustrating to try and read.

dang1y ago

Raft Consensus Animated (2014) - https://news.ycombinator.com/item?id=32484584 - Aug 2022 (67 comments)

Raft Visualization - https://news.ycombinator.com/item?id=25326645 - Dec 2020 (35 comments)

Raft: Understandable Distributed Consensus - https://news.ycombinator.com/item?id=8271957 - Sept 2014 (79 comments)

eatonphil1y ago

Ben's visualization here is great.

pa7ch1y ago

butterisgood1y ago

VR is often overlooked and really pretty easily understood

ko_pivot1y ago

The writing and the visualizations are great. The ‘continue’ button is way too frequent.

quelltext1y ago

In the log replication example, after healing the partition the uncommitted log changes in the minority group are rolled back and the leader's log is used.

jhanschoo1y ago

shiredude951y ago

eatonphil1y ago

Here's a visualization of Paxos and MultiPaxos that seems to be based on Ben's work: https://visual.ofcoder.com/.

kfrzcode1y ago

[0]: https://www.swirlds.com/downloads/SWIRLDS-TR-2016-01.pdf

lapinot1y ago

ryanthemadone1y ago

Paxos isn't a DLT, it's a consensus algorithm — granted, DLTs tend to require a consensus algorithm, but they're not the same things.

I'd also wonder what you mean by the trust layer of the internet, what are the use cases that you'd like to see solved with such a trust layer?

kfrzcode1y ago

Great point, I wasn't being precise. My point stands, however! I must also correct your understanding re: "private company."

The DLT in question - Hedera - is built on the unique consensus algorithm, the "hashgraph". The hashgraph algorithm combines a gossip-about-gossip protocol with virtual voting.

For more rigorous explanation, see [3] and the associated Coq proof of the algorithm [4].

[0]: https://github.com/hashgraph/hedera-services

[1]: https://www.lfdecentralizedtrust.org/

[2]: https://hedera.com/ecosystem/governing-council

[3]: https://hedera.com/papers

[4]: https://www.cs.cmu.edu/~crary/papers/2021/hashgraph.pdf

wg01y ago

Off topic - folks in the know, besides Paxos/Raft what are some of the other most complex algorithm in computer sciences that are widely used or are a bedrock?

withinboredom1y ago

I can think of a deceptively simple one: recursive descent parsers. They are pretty straightforward to implement but hard to read/reason about if you don't already know the grammar.

mbivert1y ago

I'd be curious to know as well; I'd bet on things pertaining to either compilers (e.g. optimization stuff), concurrency/OSes, graphics engines & image processing.

rapsey1y ago

While understandable, implementing it is however far from easy.

jimbokun1y ago

How do you write a testing framework/system for evaluating distributed algorithms like this?

Doing that in a way that's easy to automate, set up, evaluate, and tear down various scenarios seems like the hardest part to me.

1 more reply

ukd11y ago

MarkMarine1y ago

Right, but Paxos is double hard in comparison. I’ve read both papers multiple times, tried to implement and failed, and I still don’t think I understand Paxos.

mrkeen1y ago

I'm on the other side.

I think Leslie Lamport asserted that Paxos is minimal, and that "all other consensus algorithms are just Paxos with more steps". I'm inclined to believe him.

I've implemented Paxos but I can't get through "Raft for dummies" style blog posts.

Regarding Raft [1]:

  > The consensus problem is divided into three sub-problems: Leader election, Replication and Safety.

Paxos, on the other hand, seems timeless. (And the writing about it doesn't trigger my "distributed system fallacies" reaction)

[1] https://www.brianstorti.com/raft/

3 more replies

Vervious1y ago

Whereas I think each line of pseudocode in Paxos is much more motivated.

In other words, if a philosopher had to design a crash-fault protocol from scratch, without having seen any before, I think 80% of the time it would look exactly like Paxos.

jeffbee1y ago

Students of raft say they understand it but when they have to implement it they make as many mistakes as students of paxos. The simplicity is misleading.

candiddevmike1y ago

IMO, Paxos has a lot less edge cases than Raft, mostly because of the complexity of the implementation covers them/forces you to think about how to handle them.

1 more reply

matthewfcarlson1y ago

This really teaches Raft well. Is there a good example of this but for Paxos?

cedws1y ago

withinboredom1y ago

There’s a paper about that, using paxos as the base. Can’t find it right now, it is called “chained paxos” or “block paxos” or something like that.

ryanthemadone1y ago

Proof of work is a leader election algorithm!

j / k navigate · click thread line to collapse