DuckDB quacks Arrow: A zero-copy data integration between Arrow and DuckDB (opens in new tab)

(duckdb.org)

96 pointshfmuehleisen4y ago13 comments

13 comments

wenc4y ago

Props to the DuckDB team. I've been using DuckDB for the last little while since I discovered it on HN and it's been simply amazing.

Before, I would reach for Apache Spark to run queries on local Parquet datasets (on a single machine), but I've started using DuckDB for that and it's super fast (much faster than Pandas) and unfussy to integrate in Python code (with PySpark you need all kinds of boilerplate code).

DuckDB is so lightweight that it's also great for quick interactive work in Jupyter or IPython.

I also use it to do cross-format joins between Parquet (immutable) and CSV files (mutable) -- DuckDB can load both into the same environment -- which makes it easy to solve different kinds of programming problems. The dynamic stuff goes into the CSV files while the Parquet dataset remains static. For instance, when processing a large Parquet dataset, my code keeps track of groups that I'vee already processed in a CSV file. If the program is interrupted and I need to resume from where I left off, I just do a DuckDB join between Parquet and CSV and exclude already processed groups (particularly when you don't have a group key, and you're grouping by several fields). Yes, you can do all this with Spark too, but the DuckDB code is so much simpler and compact.

For large Parquet datasets, I currently roll my own code to chunk them so I can process them out-of-core, but sounds like this latest streaming feature in DuckDB takes care of that detail.

Sure, DuckDB doesn't do distributed compute like Spark, but as a SQL engine for Parquet, I find it's so much more ergonomic than Spark.

polskibus4y ago

How does this compare to Postgres + parquet FDW? Is zero copy feasible in Postgres with FDWs?

snicker74y ago

DuckDB is an embedded database. That is, it is a library one loads into their application. This is in contrast to a server database like Postgres where the database and application live on different processes (usually on different machines).

DuckDB is often compared to SQLite. But a more apt comparison might be KDB+, a proprietary vector embedded database.

1egg0myegg04y ago

I am not very familiar with the Postgres Parquet FDW, but here is an educated guess!

Postgres is a row store engine rather than a column store, so I believe there will need to be quite a lot of translation for Postgres to be able to process parquet data (DuckDB and parquet are both columnar). My hypothesis is that DuckDB would be significantly faster! However, feel free to benchmark things!

1egg0myegg04y ago

I should also add that there is a duckdb fdw, so you could have DuckDB read from your parquet files and do faster transformations before you pull your data into Postgres!

https://github.com/alitrack/duckdb_fdw

polskibus4y ago

This is crazy, have you measured that parquet + duckdb + postgresql is faster than parquet + postgresql?

lmeyerov4y ago

This may be close to what you're thinking: https://turbodbc.readthedocs.io/en/latest/pages/advanced_usa...

We saw speedups (20%+?), but wrong orders of magnitude of perf that we advocate DB vendors to aim for when we do visual analytics integrations. Arrow opens up saturating networks & PCI cards for DB<>GPU, so think going for 10-50GB/s.

waynesonfire4y ago

"In-process, serverless" lol.. who's drinking this..

snicker74y ago

???

ekzhu4y ago

TLDR: Arrow got an SQL interface provided by DuckDB.

So you have a new way to run SQL on Parquet et al through DuckDB -> Arrow -> Parquet. Of course, you still need to watch out for memory usage of your SQL query if it contains JOINs or Window functions because the integration is designed for streaming rows.

TheGuyWhoCodes4y ago

You could already run SQL on Parquet with DuckDB (even on Java). I believe it already used Arrow under the hood to read the Parquet files, could be wrong tho, but this is a more memory performant integration which is great.

hfmuehleisenOP4y ago

DuckDB has its own Parquet reader

infinite8s4y ago

DuckDB can read parquet directly - however the interesting bit is that the results of SQL queries can be returned as arrow objects into python for further processing (by pyarrow/pandas).

j / k navigate · click thread line to collapse

13 comments

wenc4y ago

Props to the DuckDB team. I've been using DuckDB for the last little while since I discovered it on HN and it's been simply amazing.

DuckDB is so lightweight that it's also great for quick interactive work in Jupyter or IPython.

For large Parquet datasets, I currently roll my own code to chunk them so I can process them out-of-core, but sounds like this latest streaming feature in DuckDB takes care of that detail.

Sure, DuckDB doesn't do distributed compute like Spark, but as a SQL engine for Parquet, I find it's so much more ergonomic than Spark.

polskibus4y ago

How does this compare to Postgres + parquet FDW? Is zero copy feasible in Postgres with FDWs?

snicker74y ago

DuckDB is often compared to SQLite. But a more apt comparison might be KDB+, a proprietary vector embedded database.

1egg0myegg04y ago

I am not very familiar with the Postgres Parquet FDW, but here is an educated guess!

1egg0myegg04y ago

I should also add that there is a duckdb fdw, so you could have DuckDB read from your parquet files and do faster transformations before you pull your data into Postgres!

https://github.com/alitrack/duckdb_fdw

polskibus4y ago

This is crazy, have you measured that parquet + duckdb + postgresql is faster than parquet + postgresql?

lmeyerov4y ago

This may be close to what you're thinking: https://turbodbc.readthedocs.io/en/latest/pages/advanced_usa...

waynesonfire4y ago

"In-process, serverless" lol.. who's drinking this..

snicker74y ago

???

ekzhu4y ago

TLDR: Arrow got an SQL interface provided by DuckDB.

TheGuyWhoCodes4y ago

hfmuehleisenOP4y ago

DuckDB has its own Parquet reader

infinite8s4y ago

DuckDB can read parquet directly - however the interesting bit is that the results of SQL queries can be returned as arrow objects into python for further processing (by pyarrow/pandas).

j / k navigate · click thread line to collapse