DSQL Vignette: Reads and Compute (opens in new tab)

(brooker.co.za)

34 pointslouis-paul1y ago5 comments

5 comments

> the QP can ask storage to do work like filtering, aggregation, projection, and other common tasks on its behalf. Unlike SQL designs that build on K/V stores, this allows to DSQL to do much of the heavy lifting of filtering and finding data right next to the data itself, on the storage replicas, without sacrificing scalability of storage or compute.

I don't know much about DB internals, but to me this sounds like lot of the compute is getting aggregated to the storage layer. I would think that "filtering, aggregation, projection" is fairly big chunk of the computation that typical DB does?

steepben1y ago

Yes it's a large chunk, but not everything! Marc had a comment on bluesky regarding this:

> Many SQL aggregations are monotonic operations (e.g. MAX, SUM, etc) that can be partially completed on each node and then post-merged. Some (e.g. DISTINCT) can be transformed into monotonic ops with some effort. Some aren't possible to do this way. (Ref on monotonicity: arxiv.org/pdf/1901.01930)

The benefit of this is that a lot more work is done _close_ to the data. The trend is that bandwidth is getting larger in data centers, but latency isn't improving at the same rate. Reducing the number of round trips between QP and storage greatly improves the overall query latency, even if you have to do more work on the storage.

zokier1y ago

> The benefit of this is that a lot more work is done _close_ to the data.

But isn't that fundamentally at odds with the central idea of disaggregation

> At a fundamental level, scaling compute in a database system requires disaggregation of storage and compute. If you stick storage and compute together, you end up needing to scale one to scale the other, which is either impossible or uneconomical.

So either you can get good perf by doing the work close to data, or get good scalability by separating compute and data. But I can't see how you can do both.

gunnarmorling1y ago

Indeed it's interesting to see aggregation as part of this list. Usually, the split of what can be pushed down to the storage layer and what cannot, is between stateless operations (filtering, projection) and stateful operations (e.g. joins, but also aggregations), as for instance they may require data from multiple storage nodes in a distributed data store.

bobnamob1y ago

See yesterdays thread

https://news.ycombinator.com/item?id=42308716

for more

j / k navigate · click thread line to collapse

5 comments

zokier1y ago

steepben1y ago

Yes it's a large chunk, but not everything! Marc had a comment on bluesky regarding this:

zokier1y ago

> The benefit of this is that a lot more work is done _close_ to the data.

But isn't that fundamentally at odds with the central idea of disaggregation

So either you can get good perf by doing the work close to data, or get good scalability by separating compute and data. But I can't see how you can do both.

gunnarmorling1y ago

bobnamob1y ago

See yesterdays thread

https://news.ycombinator.com/item?id=42308716

for more

j / k navigate · click thread line to collapse