Distributed Systems Shibboleths (opens in new tab)

(jolynch.github.io)

360 pointsjolynch4y ago66 comments

66 comments

Great post. This one always brings a smile to my face:

> Every component is crash-only

I was part of the team that developed a distributed, five-9's control system for an industry where downtime costs millions per minute and comes with a federal investigation if long enough. On top of that, the industry is made up of competitors that explicitly distrust each other, so all components had to be truly distributed, with no central coordination for anything.

Given the requirements we decided to explicitly adopt a crash-only approach. Between idempotent operations, horizontal scaling, and fast restart times, we could make failing components not impact SLAs (and we had testing to ensure it).

Once it gets out into the field (which because of how risk adverse this industry is, is measured in years), it turns out they really did not like software crashing. They interpreted crashing as bad quality, and no amount of "we do it on purpose to ensure correctness" was going to make them happy.

WJW4y ago

Rebrand it as "fault tolerant" and/or "adverse environment certified" and you should be good to go. That's how they do it in the military sector at least.

__alexs4y ago

It's not a crash it's a runtime state rollback.

caffeine4y ago

> They interpreted crashing as bad quality

The solution here is to rebrand it with some vague euphemism:

“Ah yes the component underwent a state calibration”

fh9734y ago

The term you're looking for is "software rejuvenation".

Jokes aside, there is even a body of research papers around this subject, if you need some backing.

quickthrower24y ago

I guess you were working for a stock exchange

mjb4y ago

I enjoyed reading that a lot.

> The main advantage of distributed transactions is that they make distributed systems look less distributed by choosing CP, but that inherently trades off availability!

This is true, but I suspect that its slightly missing the important thing about transactions. A transaction is an operation that takes the database from one Consistent (ACID "C", you can think about it as "legal under the business logic") state to another Consistent state. Linearizability (CAP "C") isn't enough to do that, because often changes in databases require "take from Bob and give to Alice", or "check Bob and Alice's balance and add the order", neither of which fit well into Linearizability's single-key definition of the world. Allowing developers to think about a stream of operations that moves the databases from one legal state to another is super powerful. The whole point is that it provides an abstraction that hides concurrency (ACID "I") and partial failure (ACID "A"). Saving developers from reasoning about those is a big win!

> I should also note that while Distributed Transactions might be a useful tool in building idempotency, simply wrapping a non idempotent operation (e.g. “add 1 to X”) in a transaction does not make it idempotent.

The OP is right that this isn't a panacea, especially where those transactions aren't idempotent. But transactions are a mechanism to implement idempotence ("insert order number 10 if it isn't there already"), and idempotence and ACID "C" can be really hard to achieve without transactions (or at least "I" and "A").

Transactions, CRDTs, and the CALM theorem are linked too. You can definitely have transactions in systems that aren't CAP "C" consistent, and still have them do legal things. The CALM theorem lays out one way to think about those, and CRDTs are a kind of object-oriented embodiment of that theory.

jolynchOP4y ago

Great points, transactions are certainly useful in helping developers think about state transitions. I think some of the ~snark might come from my personal struggles with trying to convey why wrapping non idempotent state transitions in "BEGIN TRANSACTION ... COMMIT" doesn't immediately make the system reliable. I completely agree transactions make understanding the state transitions easier and that is valuable.

I do think CRDTs or idempotency/fencing tokens are also a valuable way to reason about state transitions, and they can provide much lower latency in a global distributed system.

_vvhw4y ago

> Allowing developers to think about a stream of operations that moves the databases from one legal state to another is super powerful.

You're right—it's super powerful.

We used this "RSM intuition" to test TigerBeetle's strict serializability, by verifying state transitions the instant they happen, taking advantage of the test simulator's knowledge of inflight client requests, instead of trying to piece everything together and verify strict serializability after the fact.

Here's TB's state checker in 49 lines of Zig:

https://github.com/coilhq/tigerbeetle/blob/477d6df366e2c10fa...

User234y ago

I feel like the correct approach is accepting that determinacy is nonsensical in a world where time is relative and instead doubling down on nondeterministic (but predictable!) algorithms. This means leveraging concepts like commutativity and associativity to ensure predictability.

_vvhw4y ago

Surprisingly, some of the most powerful distributed systems algorithms or tools are actually deterministic.

They're powerful because they can "load the dice" and so make the distributed system more intuitive for humans to reason about, more resilient to real world faults, and do all this with more performance.

For example, Barbara Liskov and James Cowling's deterministic view change from VSR [1][2], which isn't plagued by the latency issues of RAFT's randomized dueling leader problem. VSR's deterministic view change can react to a failed primary much quicker than RAFT since heartbeat timeouts don't require the randomized "padding" that they do in RAFT, commence the leader election, and also ensure that the leader election succeeds without a split vote.

Determinism makes all this possible.

Deterministic testing [3][4] is also your best friend when it comes to testing distributed systems.

[1] An introductory talk on VSR and it's deterministic view change — https://www.youtube.com/watch?v=Wii1LX_ltIs

[2] James Cowling on determinism, working with Barbara Liskov — https://www.youtube.com/watch?v=ps106zjmjhw

[3] FoundationDB are pioneers of deterministic testing — https://www.youtube.com/watch?v=OJb8A6h9jQQ

[4] TigerBeetle's deterministic simulation tests — https://github.com/coilhq/tigerbeetle#simulation-tests

choeger4y ago

> nondeterministic (but predictable!)

Huh? How could a nondeterministic algorithm be predictable? Do you mean algorithms with a nondeterministic but overall irrelevant component ("pick a random element from this set")?

3 more replies

rubiquity4y ago

This was a great post and covers many day to day topics that practitioners tend to hand wave over, especially as distributed systems are becoming more pervasive. Dare I say even some of the statements are becoming cliches.

The section on distributed transactions could have a little more nuance. Particularly the example about the counter where I suspect any system offering transactions also has a CAS operation. Additionally the benefit of a transaction system is that you can offer bounded counters where as an AP or “strong” EC (CRDTs) system cannot.

jolynchOP4y ago

Thanks I'm glad you liked it! Your point on distributed transactions is very true, using CAS is what I meant by "transactionally advance a summary".

  For example, you could place a unique identifier on every count event and then roll 
  up those deltas in the background and transactionally advance a summary, either 
  preventing ingestion after some time delay or handling recounting.

Certainly transactions can help, but you still have to data model correctly for failure.

rubiquity4y ago

Thank you for pointing that out. I misinterpreted that bit to mean something less accurate than CAS.

fulafel4y ago

One disadvantage of people very familiar with distributed systems is that they might not try hard enough to avoid building unnecessarily distributed systems.

polio4y ago

Beyond the pedantic distinction, is there any real point to not calling "at-least-once delivery with idempotent processing" exactly-once processing? I can't imagine that any external observers would be able to tell.

jolynchOP4y ago

It conveys a false sense of correctness. Usually the system doing the processing has to use higher level or external methods of providing idempotency.

For example TCP implements "exactly once processing" by your definition but you probably still want Stripe to include idempotency keys in their charge API so you don't pay twice.

SpicyLemonZest4y ago

It’s a problem for those of us who build consumer APIs, because people without a deep distributed systems background quite reasonably expect that “exactly once” means their downstream application will receive each event exactly once. I’ve had multiple incredibly frustrating conversations where I had to find a diplomatic way to explain that yes, your connector needs to handle idempotency, and no, it’s not because there’s some “exactly once” technology our system is missing.

doctor_eval4y ago

Yeah, 100%. I've had a tier-one telco complain to me - very seriously - that my API was too complex because I wanted them to include a user-generated UUID in an API call, so we could ensure we only processed it once. It was very frustrating to be told that we're making things too complex when the whole point was to make it more reliable. A lot of people - including telco software engineers, it turns out - just treat computers as magical.

joebob424y ago

Am I missing why a distributed lock is an impossibility? The problem stated is that a partitioned node can't know it has lost the lock, but this is only an issue if there is a way to lose the lock short of returning it.

Which I guess is to say: what difference is there between a lease with an infinite timeout unless manually returned, and a "lock"?

Certainly the system deadlocks under partition but I'm not sure why that makes this "impossible".

evil-olive4y ago

> a lease with an infinite timeout unless manually returned

I would argue that "infinite timeout" is another negative shibboleth.

every operation in a distributed system has some duration after which you can be 99.9% confident (or 99.9999%, or whatever threshold you want to pick) that it was lost to the void and will never return a result.

in a robust distributed system, you want to pick a reasonable timeout value, and then take appropriate action in response to the timeout. typically this is retrying the operation, bubbling up a failure message to a higher level, or some combination of the two (retry a few times, fail if all the retries fail).

an infinite timeout represents a deliberate design choice of "I don't want to handle the case of this message or API call being lost in-transit and never returning either success or failure".

in my experience, infinite timeouts are often the cause of "hmm, this thing is up and running but seems 'stuck' and not making any progress, let me try manually restarting this service...OK, that seems to have recovered it" bugs and production alerts.

mjevans4y ago

I agree; it's also one of the few issues I have with Linux (and possibly Unix generally).

Zombie processes (dependent on some lock that will never clear) shouldn't be possible. At the very least abort (kill -9) should always be possible.

Failure should always be an option; it should be the default assumption. All other order must be wrested from that chaos.

1 more reply

macintux4y ago

Pretty sure it was Joe Armstrong who said he would take code and change infinite timeouts to 30 years and wait for the developers to protest.

junon4y ago

Yes - and further specify a TTL on ALL non-persistent (i.e. cache) data is a good rule of thumb.

rubiquity4y ago

The short answer is because if the lock holder fails your other nodes have no way of knowing if the lock holder failed (consequence of FLP Impossibility result). If you set a timeout, then that’s a lease.

The long answer is to peel this onion for yourself and see where it leads. It’s a lot of fun.

derefr4y ago

Why does it matter (in fully-general theory, which is what we're discussing here) if the lock holder fails? The lock is either released, at which point someone else can acquire it; or never released, at which point the system doesn't try to do whatever that lock is about any more. Assuming that every distributed system has to successfully make progress in all cases is just that—an assumption. A design could require that operations "fail stalled." Like a physical padlock that is "tamper proof" in the sense that it permanently seizes up when you put in the wrong key.

3 more replies

kgeist4y ago

At our job someone decided to use a ready-to-use Go library which used Redis for distributed locking. But I found that it was broken by design and completely unreliable, and we had random transient errors stemming from it. It worked OK 99.9% the time, but once in a while we were getting inconsistent state in our application. The description initially made sense and the usage looked simple. It worked by a node creating a value with a TTL, which was used to make the lock auto-expire if a node crashed. If a node found that a value under the same name was already found in Redis, it would block. Since access to Redis is serialized, all such actions were basically atomic. The problem was due to the auto-expire feature. The TTL can expire while your code under the lock is scheduled out due to GC or waiting for I/O. So the lock that you held could be released basically at any point of execution while you were supposedly under the lock. Extending the lock's TTL after every line of code isn't practical and probably prone to race conditions anyway (and the library IIRC didn't provide a way to do it). I read there's a technique called token fencing but it requires additional changes to the way your shared resources are accessed which isn't always possible. I still don't know how to do distributed locks right and there seem to be many broken implementations in the wild.

tonyg4y ago

So Redis isn't really a distributed locking system. The locks are all managed by a central, non-distributed server: Redis. But this kind of lock is useful too. One nice approach to handling crashes in a system like this is to use the idea of fate sharing [1]: you upper-bound the lifetime of a held lock by the lifetime of the TCP connection it was taken on. When the connection goes, the lock is auto-released. To support this, Redis would have to have some kind of EXPIRE_ON_DISCONNECT command - I don't know if it does or not.

The idea of fate sharing is very general and useful: you can, for example, introduce reconnectable sessions, and attach shared state to those, which gets you transport-independence and the ability to recover from transport failure.

[1] Clark, David D. “The Design Philosophy of the DARPA Internet Protocols.” ACM SIGCOMM Computer Communication Review 18, no. 4 (August 1988): 106–14. https://doi.org/10.1145/52325.52336.

1 more reply

qlm4y ago

I think it's possible to build such a lock, but if your lock will inevitably deadlock due to random failures it's probably not that useful.

convolvatron4y ago

replace "impossible" with "not really useful"

junon4y ago

This article is one of the few I've ever read on distributed systems designs that is even remotely close to what I'd call "correct". The amount of disagreements I've had with colleagues when I propose crash-only modes of operation is unbelievable.

Follow the guidelines in this post and you'll indeed result in (more) robust systems. Great writeup.

dochtman4y ago

Another shibboleth that I was expecting: backpressure.

_vvhw4y ago

Loved this!

And here are a few more positive phrases: "total order", "committed/uncommitted log", "replicated state machine", "logical timestamp", "the network is not homogeneous", "bandwidth is not infinite", "tail latency tolerance", "fault model", "message passing", "recovery", "cascading failure", "metastable", "strict serializability", and "stable storage"!

Surprisingly, at least to me, it's jarring to hear the phrase "CAP" because the field is far more nuanced. That "C" alone has so many different flavors! Far more interesting to talk about the FLP result.

Watch out also for "RPC" because that's not isomorphic with optimal consensus protocols, where messages follow multi-path routing (A->C->B->A). Granted, RAFT uses the term "RPC" (A->B->A) but that's more a concession to accessibility and not the broadest way to think of distributed systems, or necessarily the best way to implement them. Message passing is simpler and more realistic as it helps you to see and think about the network/process fault model more clearly.

Distributed testing techniques are also moving towards autonomous deterministic testing, as pioneered by FoundationDB, where the database itself is the test simulator—these tend to get into more interesting state spaces that can also be replayed instantly from a seed, compared to external test harnesses that run for hours in real time and that can't reproduce distributed bugs deterministically in exactly the same way every time.

andrewflnr4y ago

> database vendors might try just a little harder to tell the truth...

Come on, you know that's not what's going to happen. If they notice at all, they'll just incorporate the magic phrases into their BS so you have to hunt harder for a real signal.

jolynchOP4y ago

Oh shoot I forgot Shibboleths have to remain secret, I have made a terrible mistake.

curious_cat_1634y ago

Is there a ‘standard’ way to test the positive shibboleths for existence?

I am not necessarily thinking just tests running as code. Although that would be nice.

1 more reply

andrewflnr4y ago

Indeed, you have doomed us all.

Less /s, I'd be more worried if I thought the salescritters were listening, but they're mostly not.

lopatin4y ago

I'm a newbie and a little confused. On one hand there are posts like this that claim exactly-once delivery and distributed locks are impossible. But on the other hand, if I look at the docs of a distributed database, say Apache Ignite, they will say that they have exactly-once delivery [1] and distributed locks [2]. So ... which is it?

[1] https://ignite.apache.org/docs/latest/key-value-api/continuo...

[2] https://ignite.apache.org/docs/latest/distributed-locks

jolynchOP4y ago

This is exactly the kind of stuff I wrote this post about. The first (exactly once) is actually just at-least-once with deduplication based on a counter. To reliably process the events, however, you need to make your downstream idempotent as well. Think of it like your event processor might fail, so even if you only receive the message "once" if you can fail processing it you still have to think about retries. In my opinion it would be explicitly better for the event system to provide "at least once" and intentionally duplicate events on occasion to test your processors ability to handle duplicates.

The second (lock) is actually a lease fwict, and writing code in the locked body that assumes it will be truly mutually exclusive is pretty dangerous (see the linked post from Martin [1] for why).

[1] https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...

lopatin4y ago

> In my opinion it would be explicitly better for the event system to provide "at least once" and intentionally duplicate events on occasion to test your processors ability to handle duplicates.

You're a madman, but I think you're right.

Thanks for the locking link.

nosequel4y ago

Software vendors lie. Even open source ones. In some cases they get REALLY close to the truth, but aren't forthcoming with that last 1% case where their statement becomes a lie.

If you want to see a ton of lies, go and start reading through all the Jepsen posts tearing down those dubious claims.

http://jepsen.io/analyses

Ozzie_osman4y ago

Great post!

A few more positive shiboleths.

One would be eventual consistency.

Another would be discussing write paths vs read paths (or patterns) and recognizing that those can be decoupled (or a mention of CQRS).

sombremesa4y ago

> One would be eventual consistency.

In the vein of the TFA, this would be a negative shibboleth (which just goes to show how being pedantic in certain contexts is just silly). The reason being that eventual consistency has no guarantee on what "eventual" means. If your replicas converge ten years from when a change is made, you can (correctly) claim to have eventual consistency.

HelloNurse4y ago

If consistency is "eventual" it usually means that the application doesn't care about inconsistent states enough to make them consistent, making the whole architecture luck-oriented.

Eventual consistency can be a rational choice or a quasi-necessity, but in practice it's a shortcut for careless optimists; I wouldn't expect them to analyze how inconsistent states can go wrong, whether they are an acceptable risk, and how to deal with abnormal situations.

bmcniel4y ago

It is notable that the linked article [1] is critical of those that don't understand these concepts, then goes on to misunderstand some of the concepts.

[1] https://codahale.com/you-cant-sacrifice-partition-tolerance/

fortran774y ago

For those curious, here's the origin of the word "Shibboleth"

https://www.sefaria.org/Judges.12.6?ven=Tanakh:_The_Holy_Scr...

furstenheim4y ago

I liked the article a lot.

I prefer the term retryable to idempotent. If there's a failure in the first call, to be truly idempotent it should fail on the second.

Retryable on the other hand is easier to argue about. Important thing is not the response but the end state of the system.

lupire4y ago

Idempotent means that repeating the same operation eventually stabilizes, not that Op^2 = Op. It combines retryability of failures with retryability of successes.

Alternatively, idempotency applies to successful operations, orthogonal from error cases.

Retryability doesn't help in the case of the (bad) operation "+1" which is not idempotent.

wnoise4y ago

Idempotent means exactly Op^2 = Op. If you want a word that means those other things, go find another, this one has long been taken.

j / k navigate · click thread line to collapse

66 comments

cpitman4y ago

Great post. This one always brings a smile to my face:

> Every component is crash-only

WJW4y ago

Rebrand it as "fault tolerant" and/or "adverse environment certified" and you should be good to go. That's how they do it in the military sector at least.

__alexs4y ago

It's not a crash it's a runtime state rollback.

caffeine4y ago

> They interpreted crashing as bad quality

The solution here is to rebrand it with some vague euphemism:

“Ah yes the component underwent a state calibration”

fh9734y ago

The term you're looking for is "software rejuvenation".

Jokes aside, there is even a body of research papers around this subject, if you need some backing.

quickthrower24y ago

I guess you were working for a stock exchange

mjb4y ago

I enjoyed reading that a lot.

> The main advantage of distributed transactions is that they make distributed systems look less distributed by choosing CP, but that inherently trades off availability!

jolynchOP4y ago

I do think CRDTs or idempotency/fencing tokens are also a valuable way to reason about state transitions, and they can provide much lower latency in a global distributed system.

_vvhw4y ago

> Allowing developers to think about a stream of operations that moves the databases from one legal state to another is super powerful.

You're right—it's super powerful.

Here's TB's state checker in 49 lines of Zig:

https://github.com/coilhq/tigerbeetle/blob/477d6df366e2c10fa...

User234y ago

_vvhw4y ago

Surprisingly, some of the most powerful distributed systems algorithms or tools are actually deterministic.

Determinism makes all this possible.

Deterministic testing [3][4] is also your best friend when it comes to testing distributed systems.

[1] An introductory talk on VSR and it's deterministic view change — https://www.youtube.com/watch?v=Wii1LX_ltIs

[2] James Cowling on determinism, working with Barbara Liskov — https://www.youtube.com/watch?v=ps106zjmjhw

[3] FoundationDB are pioneers of deterministic testing — https://www.youtube.com/watch?v=OJb8A6h9jQQ

[4] TigerBeetle's deterministic simulation tests — https://github.com/coilhq/tigerbeetle#simulation-tests

choeger4y ago

> nondeterministic (but predictable!)

Huh? How could a nondeterministic algorithm be predictable? Do you mean algorithms with a nondeterministic but overall irrelevant component ("pick a random element from this set")?

3 more replies

rubiquity4y ago

jolynchOP4y ago

Thanks I'm glad you liked it! Your point on distributed transactions is very true, using CAS is what I meant by "transactionally advance a summary".

  For example, you could place a unique identifier on every count event and then roll 
  up those deltas in the background and transactionally advance a summary, either 
  preventing ingestion after some time delay or handling recounting.

Certainly transactions can help, but you still have to data model correctly for failure.

rubiquity4y ago

Thank you for pointing that out. I misinterpreted that bit to mean something less accurate than CAS.

fulafel4y ago

One disadvantage of people very familiar with distributed systems is that they might not try hard enough to avoid building unnecessarily distributed systems.

polio4y ago

jolynchOP4y ago

It conveys a false sense of correctness. Usually the system doing the processing has to use higher level or external methods of providing idempotency.

For example TCP implements "exactly once processing" by your definition but you probably still want Stripe to include idempotency keys in their charge API so you don't pay twice.

SpicyLemonZest4y ago

doctor_eval4y ago

joebob424y ago

Which I guess is to say: what difference is there between a lease with an infinite timeout unless manually returned, and a "lock"?

Certainly the system deadlocks under partition but I'm not sure why that makes this "impossible".

evil-olive4y ago

> a lease with an infinite timeout unless manually returned

I would argue that "infinite timeout" is another negative shibboleth.

an infinite timeout represents a deliberate design choice of "I don't want to handle the case of this message or API call being lost in-transit and never returning either success or failure".

mjevans4y ago

I agree; it's also one of the few issues I have with Linux (and possibly Unix generally).

Zombie processes (dependent on some lock that will never clear) shouldn't be possible. At the very least abort (kill -9) should always be possible.

Failure should always be an option; it should be the default assumption. All other order must be wrested from that chaos.

1 more reply

macintux4y ago

Pretty sure it was Joe Armstrong who said he would take code and change infinite timeouts to 30 years and wait for the developers to protest.

junon4y ago

Yes - and further specify a TTL on ALL non-persistent (i.e. cache) data is a good rule of thumb.

rubiquity4y ago

The long answer is to peel this onion for yourself and see where it leads. It’s a lot of fun.

derefr4y ago

3 more replies

kgeist4y ago

tonyg4y ago

[1] Clark, David D. “The Design Philosophy of the DARPA Internet Protocols.” ACM SIGCOMM Computer Communication Review 18, no. 4 (August 1988): 106–14. https://doi.org/10.1145/52325.52336.

1 more reply

qlm4y ago

I think it's possible to build such a lock, but if your lock will inevitably deadlock due to random failures it's probably not that useful.

convolvatron4y ago

replace "impossible" with "not really useful"

junon4y ago

Follow the guidelines in this post and you'll indeed result in (more) robust systems. Great writeup.

dochtman4y ago

Another shibboleth that I was expecting: backpressure.

_vvhw4y ago

Loved this!

andrewflnr4y ago

> database vendors might try just a little harder to tell the truth...

Come on, you know that's not what's going to happen. If they notice at all, they'll just incorporate the magic phrases into their BS so you have to hunt harder for a real signal.

jolynchOP4y ago

Oh shoot I forgot Shibboleths have to remain secret, I have made a terrible mistake.

curious_cat_1634y ago

Is there a ‘standard’ way to test the positive shibboleths for existence?

I am not necessarily thinking just tests running as code. Although that would be nice.

1 more reply

andrewflnr4y ago

Indeed, you have doomed us all.

Less /s, I'd be more worried if I thought the salescritters were listening, but they're mostly not.

lopatin4y ago

[1] https://ignite.apache.org/docs/latest/key-value-api/continuo...

[2] https://ignite.apache.org/docs/latest/distributed-locks

jolynchOP4y ago

The second (lock) is actually a lease fwict, and writing code in the locked body that assumes it will be truly mutually exclusive is pretty dangerous (see the linked post from Martin [1] for why).

[1] https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...

lopatin4y ago

> In my opinion it would be explicitly better for the event system to provide "at least once" and intentionally duplicate events on occasion to test your processors ability to handle duplicates.

You're a madman, but I think you're right.

Thanks for the locking link.

nosequel4y ago

Software vendors lie. Even open source ones. In some cases they get REALLY close to the truth, but aren't forthcoming with that last 1% case where their statement becomes a lie.

If you want to see a ton of lies, go and start reading through all the Jepsen posts tearing down those dubious claims.

http://jepsen.io/analyses

Ozzie_osman4y ago

Great post!

A few more positive shiboleths.

One would be eventual consistency.

Another would be discussing write paths vs read paths (or patterns) and recognizing that those can be decoupled (or a mention of CQRS).

sombremesa4y ago

> One would be eventual consistency.

HelloNurse4y ago

If consistency is "eventual" it usually means that the application doesn't care about inconsistent states enough to make them consistent, making the whole architecture luck-oriented.

bmcniel4y ago

It is notable that the linked article [1] is critical of those that don't understand these concepts, then goes on to misunderstand some of the concepts.

[1] https://codahale.com/you-cant-sacrifice-partition-tolerance/

fortran774y ago

For those curious, here's the origin of the word "Shibboleth"

https://www.sefaria.org/Judges.12.6?ven=Tanakh:_The_Holy_Scr...

furstenheim4y ago

I liked the article a lot.

I prefer the term retryable to idempotent. If there's a failure in the first call, to be truly idempotent it should fail on the second.

Retryable on the other hand is easier to argue about. Important thing is not the response but the end state of the system.

lupire4y ago

Idempotent means that repeating the same operation eventually stabilizes, not that Op^2 = Op. It combines retryability of failures with retryability of successes.

Alternatively, idempotency applies to successful operations, orthogonal from error cases.

Retryability doesn't help in the case of the (bad) operation "+1" which is not idempotent.

wnoise4y ago

Idempotent means exactly Op^2 = Op. If you want a word that means those other things, go find another, this one has long been taken.

j / k navigate · click thread line to collapse