undefined | Better HN

0 pointscube22222y ago0 comments

I might’ve been unclear, so to clarify:

The overhead of fetching from S3 via a naive Go implementation (goroutine per object) to disk and then running duckdb on that was lower than using duckdb end-to-end.

I was measuring the S3 overhead in both cases.

0 comments

wenc2y ago

No I got you.

Like I said S3 is a high throughput high latency storage. When you fetch the S3 object to disk, that’s a high throughput operation and S3 excels at that. Once on disk DuckDB can operate at low latency.

If you run DuckDB end to end as a database engine on S3, it has to do partial reads on parquet on S3 etc. and has to deal with S3 latencies and it can end up being slower than what you described above.

For long running operations where I can chunk the data, I often copy chunks to local disk before running DuckDB. It’s a lot faster than running DuckDB directly on S3.

The downside is I need enough disk space.

theLiminator2y ago

Theoretically reading directly from s3 should be faster. Downloading all the data from s3 and then running the query locally is basically an extreme form of pre-fetching. DuckDB could be written to pre-fetch data concurrently using some heuristics and provide similar or better performance.

wenc2y ago

Makes sense — every S3 call is high latency so the fewer you make the better.

Prefetching would help reduce the number of those high latency calls which databases naturally make.

We often think of S3 as a file system but it isn’t one — it differs in fundamental ways from one. (Also treating it as a filesystem isn’t performant at all — I tried s3fs and mountpoint but both were slow)

cube2222OP2y ago

Oh, yeah, that does make sense, and I suppose it might also be a better approach if the objects were large (especially if you can avoid reading some parts). In that case prefetching could be wasteful (or result in OOMs).

I didn’t think of that - thanks for the explanation!

j / k navigate · click thread line to collapse

0 comments

wenc2y ago

No I got you.

For long running operations where I can chunk the data, I often copy chunks to local disk before running DuckDB. It’s a lot faster than running DuckDB directly on S3.

The downside is I need enough disk space.

theLiminator2y ago

wenc2y ago

Makes sense — every S3 call is high latency so the fewer you make the better.

Prefetching would help reduce the number of those high latency calls which databases naturally make.

cube2222OP2y ago

I didn’t think of that - thanks for the explanation!

j / k navigate · click thread line to collapse