undefined | Better HN

0 pointsUltraSane1mo ago0 comments

Amazon S3 Select enables SQL queries directly on CSV, JSON, or Apache Parquet objects, allowing retrieval of filtered data subsets to reduce latency and costs

0 comments

staticassertion1mo ago

S3 Select is, very sadly, deprecated. It also supported HTTP RANGE headers! But they've killed it and I'll never forgive them :)

Still, it's nbd. You can cache a billion Parquet header/footers on disk/ memory and get 90% of the performance (or better tbh).

dotgov29d ago

Caching Parquet headers/footers sounds super interesting. Can you say more about how you implemented it?

staticassertion29d ago

Currently there's nothing in my headers, but the footer is straightforward. There's the schema, row group metadata, some statistics, byte offsets for each column in a group, page index, etc. It's everything you'd want if you wanted to reject a query outright or, if necessary, query extremely efficiently.

min/max stats for a column are huge because I pre-encode any low-cardinality strings into integers. This means I can skip entire row groups without every touching S3, just with that footer information, and if I don't have it cached I can read it and skip decoding anything that doesn't have my data.

Footers can get quite large in one sense - 10s-100s of KB for a very large file. But that's obviously tiny compared to a multi-GB Parquet file, and the data can compress extremely well for a second/ third tier cache. You can store 1000s of these pre-parsed in memory no problem, and store 10s of thousands more on disk.

I've spent 0 time optimizing my footers currently. They can get smaller than they are, I assume, but I've not put much thought. In fact, I don't have to assume, I know that my own custom metadata overlaps with the existing parquet stats and I just haven't bothered to deal with it. TBH there are a bunch of layout optimizations I've yet to explore, like using headers would obviously have some benefits (streaming) whereas right now I do a sort of "attempt to grab the footer from the end in chunks until we find it lol". But it doesn't come up because... caching. And there are worse things than a few spurious RANGE requests.

1 more reply

UltraSaneOP1mo ago

Wow I didn't know that. To be fair now that S3 tables exists it is rather redundant.

j / k navigate · click thread line to collapse