> There is one rule, pretty much the only rule, that determines success in scaling data systems: avoid coordination whenever possible. This one rule neatly summarizes my entire career until this point --- the quarter century I’ve spent teaching, innovating, and building scalable data systems. At the end of the day it all comes down to one thing --- avoiding coordination.
In one of my previous jobs there was an inventory management service which was broken into two layers - one service that kept track of the inventory at a particular warehouse (including the location and more specific details) and another one which tracked the overall inventory across the entire system (without the specific details).
Whenever an inventory had to be allocated there was a need to co-ordinate between the global and local inventory systems. Easy enough to do with optimistic locking and performing compensation actions. But it turned into a performance bottleneck really soon.
Deciding to replace that system with an immutable ledger which gets lazily materialized improved throughput by a lot. You obviously had to handle the cases where the stale materialization meant that you'd just overcommited inventory but it was easy enough to do and allowed us to scale up the peak order processing rate by 6x.
I've seen some stream processing systems follow the partition data and apply all transformations at once (e.g. Kafka Streams) while others parallelize the transformations (e.g. Apache Storm IIRC).
Also isn't there a tradeoff that in a depth-first (for lack of better term) processing paradigm error-recovery becomes more costly?
Usually those other operations which force the local pipeline to end occur in a query plan prior to hitting any kind of tradeoff of doing too much in a local pipeline, since local pipelines are SO much faster than what happens when communication is required.