Parallel Grouped Aggregation in DuckDB (opens in new tab)

(duckdb.org)

90 pointshfmuehleisen4y ago10 comments

10 comments

This is great. In terms of real-world uses, I'm currently working on enabling DuckDB as a backend in Splink[1], software for record linkage at scale. Central to the software is an iterative algorithm (Expectation Maximisation) that performs a large number of group-by aggregations on large tables.

Until recently, it was PySpark only, but we've found DuckDB gives us great performance on medium size data. This will be enabled in a forthcoming release (we have an early pre-release demo of duckdb backend[2]). This new DuckDB backend will probably be fast enough for the majority of our users, who don't have massive datasets.

With this in mind, excited to hear that: > Another large area of future work is to make our aggregate hash table work with out-of-core operations, where an individual hash table no longer fits in memory, this is particularly problematic when merging.

This would be an amazing addition. Our users typically need to process sensitive data, and spinning up Spark can be a challenge from an infrastructure perspective. I'm imagining as we go forwards, more and more will be possible on a single beefy machine that is easily spun up in the cloud.

Anyway, really just wanted to say thanks to the DuckDB team for great work - you're enabling a lot of value downstream!

[1] https://github.com/moj-analytical-services/splink [2] https://github.com/moj-analytical-services/splink_demos/tree...

henrydark4y ago

Splink over duckdb is the bomb.

My duckdb wrapper I sent you in the github issue a few weeks ago linked a pair of five million record datasets in about twenty minutes. Spark took about the three hours to do the same job with an infinite resources cluster.

mavam4y ago

I had chat with Hannes, the DuckDB co-founder, a few weeks ago. They are building awesome stuff to become the "SQLite of OLAP". The team comes with a strong academic background and is tuned into the data engineering world.

At Tenzir, we looked at DuckDB as embeddable backend engine to do the heavy lifting of query execution of our engine [1]. Our idea is throwing over a set of Parquet files, along with a query; initially SQL but perhaps soon Substrait [2] if it picks up.

We also experiment with a cloud deployment [3] where a different set of I/O path may warrant a different backend engine. Right now, we're working on a serverless approach leveraging Datafusion (and depending on maturity, Ballista at some point).

My hunch is that we will see more pluggability in this space moving forward. It's not only meaningful from an open-core business model perspective, but also pays dividends to the UX. The company that's solving a domain problem (for us: security operations center infrastructre) can leverage a high-bandwidth drop-in engine and only needs to wire it properly. This requires much less data engineers than building a poorman's version of the same inhouse.

We also have the R use case, e.g., to write reports in Rmarkdown that crunch some customer security telemetry, highlighting outliers or other noteworthy events. We're not there yet, but with the right query backend, I would expect to get this almost for free. We're close to being ready to use Arrow Flight for interop, but it's not zero-copy. DuckDB has demonstrated the zero-copy approach recently [4], going through the C API. (The story is also relevant when doing s/R/Python/, FWIW.)

[1] https://github.com/tenzir/vast [2] https://github.com/substrait-io/substrait [3] https://github.com/tenzir/vast/tree/master/cloud/aws [4] https://duckdb.org/2021/12/03/duck-arrow.html

lidavidm4y ago

Hey, I'd be curious to hear what's stopping you from using Arrow Flight (unless you mean that you plan to use Arrow Flight for remote things, and a zero-copy interface locally). (I am an Arrow maintainer and have worked a lot with Flight, though I speak only for myself here.)

mavam4y ago

Basically that: use Flight as default IPC mechanism, but when local, use a plasma’ish mmap facility (for zero copy).

keewee74y ago

Polars+DuckDB beats Pandas.

Tidyverse (R) is superior for data exploration but R is not fun to deploy and make complex multi job data pipelines.

claytonjy4y ago

Despite having switched to Python after many years of being a tidyverse acolyte, I don't really understand this argument against deployed R.

Are folks who say this not using containers? R has been at least as easy to dockerize as Python since whenever Rocker started, only easier with more recent package management options. Once dockerized, my only R complaints are around logging inconsistencies.

I used to think the culture around R meant that productionizing arbitrary code was harder on average than in Python...but years of suffering with the pandas API has me thinking the opposite these days.

I can trust a junior R dev to write reusable pure functions but can't trust a senior Python dev to do the same!

hfmuehleisenOP4y ago

DuckDB also works fine with R data frames so there is really no downside to using R in this case

Dayshine4y ago

For one, in my experience, CRAN only stores binaries of the most recent release of each package. This means that either you have to accept you can never rebuild a docker image, or you have to make sure that you are always able to recompile all of your R packages from source.

This means you don't just have to pin your R package versions, you have to pin all the build dependencies.

And you have to have a different image for different sets of R packages because they might have different build dependencies.

claytonjy4y ago

Reproducibility is of course a valid concern, but I never had issues with that thanks to Rocker and MRAN.

I do sympathize with long from-source build times; as a Linux user I don't think I had binaries available until I stopped using R, so I've spent days, perhaps weeks, waiting for dplyr to install over the course of my R usage.

> different image for different sets of R packages because they might have different build dependencies

Is this not true of all software in all languages?

j / k navigate · click thread line to collapse

10 comments

RobinL4y ago

Anyway, really just wanted to say thanks to the DuckDB team for great work - you're enabling a lot of value downstream!

[1] https://github.com/moj-analytical-services/splink [2] https://github.com/moj-analytical-services/splink_demos/tree...

henrydark4y ago

Splink over duckdb is the bomb.

mavam4y ago

[1] https://github.com/tenzir/vast [2] https://github.com/substrait-io/substrait [3] https://github.com/tenzir/vast/tree/master/cloud/aws [4] https://duckdb.org/2021/12/03/duck-arrow.html

lidavidm4y ago

mavam4y ago

Basically that: use Flight as default IPC mechanism, but when local, use a plasma’ish mmap facility (for zero copy).

keewee74y ago

Polars+DuckDB beats Pandas.

Tidyverse (R) is superior for data exploration but R is not fun to deploy and make complex multi job data pipelines.

claytonjy4y ago

Despite having switched to Python after many years of being a tidyverse acolyte, I don't really understand this argument against deployed R.

I can trust a junior R dev to write reusable pure functions but can't trust a senior Python dev to do the same!

hfmuehleisenOP4y ago

DuckDB also works fine with R data frames so there is really no downside to using R in this case

Dayshine4y ago

This means you don't just have to pin your R package versions, you have to pin all the build dependencies.

And you have to have a different image for different sets of R packages because they might have different build dependencies.

claytonjy4y ago

Reproducibility is of course a valid concern, but I never had issues with that thanks to Rocker and MRAN.

> different image for different sets of R packages because they might have different build dependencies

Is this not true of all software in all languages?

j / k navigate · click thread line to collapse