High-performance, exactly-once, failure-oblivious distributed programming (2018) (opens in new tab)

(christophermeiklejohn.com)

129 pointscmeiklejohn6y ago26 comments

26 comments

Exactly once processing is not possible in distributed systems. Anyone that tries to sell that snake oil is dishonest and anyone who buys it should not be making purchasing decisions. The definition and requirement of idempotent processing means systems must be able to handle messages delivered more than once, which irrefutably proves there is no such thing as exactly once.

Even within centralized, monolithic systems with no outside interaction, laws of physics and reality still apply - a pull on the cord, an earthquake, a flooding or a myriad of other things may interrupt message processing resulting in either exactly zero or more times a message will be processed even if the message were being processed within the confines of an embedded micro-controller using hand crafted assembler code.

zzzcpan6y ago

There is no literal exactly once of course, because it's not physically possible, but "exactly-once" semantics are possible in distributed systems. Data can be resynchronized, processes can be restarted with the side effects removed, etc.

farazbabar6y ago

And in the absence of correctly designed idempotent behaviors across the processing pipeline, it becomes unbelievably complex to handle all the corner cases correctly. The problem with marketing folks parroting and promoting the exactly once semantics is that any weak link in the processing chain compromises the correctness of entire system and what is worse is that the implications of subtle errors may not become evident for years to come (history is rife with legal cases and negative consequences of getting it wrong when it comes to folks' money).

In critical sectors like banking, finance and payments, designing systems in well understood and boring manner is absolutely critical while hoping for the best based on shiny brochures and marketechture is a sure recipe for disaster.

lostcolony6y ago

"Data can be resynchronized" - yeah, that means you send it again. Not exactly once.

"Exactly once semantics" is semantics. It's at least once with idempotency, which may or may not be able to be guaranteed on the part of the system depending on actual implementation details which the marketing fluff will invariably leave out if they're saying "exactly once". And that's a major problem when relying on such 'semantics'.

1 more reply

polskibus6y ago

Is there a good tutorial on modeling messages in idempotent way that uses non trivial examples and real life cases? Whenever I talk with others working on such systems it turns out that many messages aren't and that there are compensating actions that repair/sync system components when needed/on demand.

hosh6y ago

This isn't a tutorial, but this paper has been blowing me away: http://www.neilconway.org/docs/socc2012_bloom_lattices.pdf

I bet analyzing those messages using this framework would reveal exactly why compensating actions are needed.

The takeaway I am getting from this paper is that, you're not just looking for idempotency. Monotonicity is important, as well as commutativity and associativity. If it cannot be expressed in that way, then coordination is required.

docker_up6y ago

This is what I've been told as well from distributed systems experts. It's more "exactly-once most of the time" but believing that it will always be exactly once in all failure situations is delusional.

SpicyLemonZest6y ago

Agreed. It’s not that I don’t get it; people have always built systems where even idempotency is impossible, and it’s a lot easier to convince business-side stakeholders that “exactly once” can’t be compromised on than that non-idempotency is a critical design flaw. But it’s easier because the term substantially misleads them about what you’re getting.

whatwgg6y ago

My understanding is that Exactly Once Processing to Completion" is not the same as "Exactly Once Processing." In the former, any side effect is not committed until the processing has completed, and if you have an externally consistent database then you can have exactly once processing [to completion]. What am I missing?

farazbabar6y ago

There are a number of complexities involved but external factors outside of system's control are most critical. Consider for example a customer that places an order with a merchant for $800 paying for three items with prices of $200, $300 and $300 - merchant tries to settle $300 for item 2 that is available in the inventory, then $200 followed by $300 the next day. What if the original $300 got declined due to insufficient funds (bank transfers can take hours or even a day to decline)? How does the bank know whether the next $300 is a retry or remainder of the original authorization? The correct operation requires cooperation and idempotent behavior from the merchant, their acquirer, the network used by the customer and correct implementation at the issuing or money holding bank. Implementing distributed transactions and settlement ledgers across multiple parties? That's what blockchains are for but we are quite a ways from that (part game theory, part tragedy of the commons, if you will).

1 more reply

hosh6y ago

Have you read this paper? http://www.neilconway.org/docs/socc2012_bloom_lattices.pdf

There are additional efficiencies if the operators are commutative and associative in addition to being idempotent.

cfontes6y ago

Kafka exactly-once semantics addresses the main issue of the article I think.

It's now relatively simple for a developer to implement a system with exactly once guarantee as long as you take care of the world that is not inside a Kafka transaction (integrations with third parties and such), which is still not super easy sometimes, but less so then the distributed transaction that will happen inside Kafka.

Kafka hides the complexity really well from my use of it so far is very reliable with the "new" semantics.

pdpi6y ago

I'm getting mighty tired of the "exactly-once" thing.

Everybody and their uncles seem to have picked up on this trend of advertising at-least-once systems as exactly-once, then burying somewhere in the docs that you're expected to guarantee idempotency yourself to get the appearance of exactly-once. That was the state of the art decades years ago, it's the state of the art now, and it's pretty damn dishonest to sell quality-of-life improvements as a fundamental shift in the guarantees/properties of these systems.

dualogy6y ago

> at-least-once systems with idempotency to get the appearance of exactly-once

What baffles me even more is why the above is apparently not generally considered good-enough, elegant-enough --- and as a bonus, not violating the laws of physics either? Both sides of the coin are quite tameable and implementable. And together deliver what was wanted in the first place, and effectively. Curious in any subtle edge-cases I might have missed here!

2 more replies

i0exception6y ago

Kafka can probably guarantee exactly ones semantics on publishing (conditions apply). It definitely cannot guarantee exactly once semantics on the consumer and processing side. Imagine a scenario where you receive a message from Kafka and process it, but the processor crashes or has a network partition right after. There's no way for the message to be acknowledged and you either have to design your system to be idempotent or handle exactly-once semantics further down the stack.

Databases have been handling exactly-once semantics for decades now. What Kafka is doing is not new and actually gives you a false sense of security when it comes to these kinds of things.

docker_up6y ago

Kafka exactly-once semantics is just that, they are "semantics".

What Kafka supports is exactly-once processing which has been supported in other stream processing frameworks such as Apache Storm years before Confluent's marketing. Duplicates are possible in Kafka with the current implementation of exactly-once, if one uses Kafka's consumer api it will de-dedupe on the processing side.

So no, there is no such thing as exactly-once in distributed systems.

hosh6y ago

I've been recently reading the papers coming out of the Berkley Disroderly Lab -- Bloom(L) languages, lattices, composing eventually-consistent, coordination-free systems. It's interesting to read this article with that lens. There are some properties that are similar, but this one looks like it is designed to let people continue programming the way they are at the cost of increased coordination with other systems.

The idea of a replayable log seems to be able to convert a disordered sequence of events into something that is ordered. Whereas, the Bloom(L) stuff constructs algorithms that only requires partial order. An event stream can be disordered because the functions being used are monotonic, and the compositions of the data structure uses operators that are commutative, associative, and idempotent. (Thus, there is no requirement for exactly-once guarantee, or an ordered event stream).

kerblang6y ago

> Many cloud service designs today rely on durable queues, such as Event Hub or Kafka.

AFAIK nowhere in the Kafka documentation does it use the term "queue", and unless you only have one consumer per consumer group it's impossible to guarantee FIFO behavior. Maybe call me a nitpicker but I've seen this "queue" language lead to completely wrong assumptions about kafka.

frankmcsherry6y ago

I googled "kafka documention", went to https://kafka.apache.org/documentation/, searched for "queue" and found 29 matches.

j / k navigate · click thread line to collapse

26 comments

farazbabar6y ago

zzzcpan6y ago

farazbabar6y ago

lostcolony6y ago

"Data can be resynchronized" - yeah, that means you send it again. Not exactly once.

1 more reply

polskibus6y ago

hosh6y ago

This isn't a tutorial, but this paper has been blowing me away: http://www.neilconway.org/docs/socc2012_bloom_lattices.pdf

I bet analyzing those messages using this framework would reveal exactly why compensating actions are needed.

docker_up6y ago

SpicyLemonZest6y ago

whatwgg6y ago

farazbabar6y ago

1 more reply

hosh6y ago

Have you read this paper? http://www.neilconway.org/docs/socc2012_bloom_lattices.pdf

There are additional efficiencies if the operators are commutative and associative in addition to being idempotent.

cfontes6y ago

Kafka exactly-once semantics addresses the main issue of the article I think.

Kafka hides the complexity really well from my use of it so far is very reliable with the "new" semantics.

pdpi6y ago

I'm getting mighty tired of the "exactly-once" thing.

dualogy6y ago

> at-least-once systems with idempotency to get the appearance of exactly-once

2 more replies

i0exception6y ago

Databases have been handling exactly-once semantics for decades now. What Kafka is doing is not new and actually gives you a false sense of security when it comes to these kinds of things.

docker_up6y ago

Kafka exactly-once semantics is just that, they are "semantics".

So no, there is no such thing as exactly-once in distributed systems.

hosh6y ago

kerblang6y ago

> Many cloud service designs today rely on durable queues, such as Event Hub or Kafka.

frankmcsherry6y ago

I googled "kafka documention", went to https://kafka.apache.org/documentation/, searched for "queue" and found 29 matches.

j / k navigate · click thread line to collapse