Data Mesh: The Answer to the Data Warehouse Hypocrisy (opens in new tab)

(blog.starburstdata.com)

25 pointselectrum5y ago4 comments

4 comments

A golden nugget right at the beginning of the article.

> There is one rule, pretty much the only rule, that determines success in scaling data systems: avoid coordination whenever possible. This one rule neatly summarizes my entire career until this point --- the quarter century I’ve spent teaching, innovating, and building scalable data systems. At the end of the day it all comes down to one thing --- avoiding coordination.

In one of my previous jobs there was an inventory management service which was broken into two layers - one service that kept track of the inventory at a particular warehouse (including the location and more specific details) and another one which tracked the overall inventory across the entire system (without the specific details).

Whenever an inventory had to be allocated there was a need to co-ordinate between the global and local inventory systems. Easy enough to do with optimistic locking and performing compensation actions. But it turned into a performance bottleneck really soon.

Deciding to replace that system with an immutable ledger which gets lazily materialized improved throughput by a lot. You obviously had to handle the cases where the stale materialization meant that you'd just overcommited inventory but it was easy enough to do and allowed us to scale up the peak order processing rate by 6x.

abadid5y ago

I'm the author of this piece. I'm happy to respond to comments in this thread.

hashhar5y ago

At what point does it not make sense anymore to perform all steps of an operation on a single partition of data together? Is there a point of diminishing returns?

I've seen some stream processing systems follow the partition data and apply all transformations at once (e.g. Kafka Streams) while others parallelize the transformations (e.g. Apache Storm IIRC).

Also isn't there a tradeoff that in a depth-first (for lack of better term) processing paradigm error-recovery becomes more costly?

abadid5y ago

In general, whenever you need to perform a join (of multiple datasets), that ends the pipeline of local operations on a partition. Other operators as well that necessarily require data from other partitions end local pipelines. This is why linear scalability is not completely achieved in practice. Most interactions with data cannot be performed in a completely partitionable way.

Usually those other operations which force the local pipeline to end occur in a query plan prior to hitting any kind of tradeoff of doing too much in a local pipeline, since local pipelines are SO much faster than what happens when communication is required.

j / k navigate · click thread line to collapse

4 comments

hashhar5y ago

A golden nugget right at the beginning of the article.

abadid5y ago

I'm the author of this piece. I'm happy to respond to comments in this thread.

hashhar5y ago

At what point does it not make sense anymore to perform all steps of an operation on a single partition of data together? Is there a point of diminishing returns?

I've seen some stream processing systems follow the partition data and apply all transformations at once (e.g. Kafka Streams) while others parallelize the transformations (e.g. Apache Storm IIRC).

Also isn't there a tradeoff that in a depth-first (for lack of better term) processing paradigm error-recovery becomes more costly?

abadid5y ago

j / k navigate · click thread line to collapse