Until recently, it was PySpark only, but we've found DuckDB gives us great performance on medium size data. This will be enabled in a forthcoming release (we have an early pre-release demo of duckdb backend[2]). This new DuckDB backend will probably be fast enough for the majority of our users, who don't have massive datasets.
With this in mind, excited to hear that: > Another large area of future work is to make our aggregate hash table work with out-of-core operations, where an individual hash table no longer fits in memory, this is particularly problematic when merging.
This would be an amazing addition. Our users typically need to process sensitive data, and spinning up Spark can be a challenge from an infrastructure perspective. I'm imagining as we go forwards, more and more will be possible on a single beefy machine that is easily spun up in the cloud.
Anyway, really just wanted to say thanks to the DuckDB team for great work - you're enabling a lot of value downstream!
[1] https://github.com/moj-analytical-services/splink [2] https://github.com/moj-analytical-services/splink_demos/tree...
My duckdb wrapper I sent you in the github issue a few weeks ago linked a pair of five million record datasets in about twenty minutes. Spark took about the three hours to do the same job with an infinite resources cluster.
At Tenzir, we looked at DuckDB as embeddable backend engine to do the heavy lifting of query execution of our engine [1]. Our idea is throwing over a set of Parquet files, along with a query; initially SQL but perhaps soon Substrait [2] if it picks up.
We also experiment with a cloud deployment [3] where a different set of I/O path may warrant a different backend engine. Right now, we're working on a serverless approach leveraging Datafusion (and depending on maturity, Ballista at some point).
My hunch is that we will see more pluggability in this space moving forward. It's not only meaningful from an open-core business model perspective, but also pays dividends to the UX. The company that's solving a domain problem (for us: security operations center infrastructre) can leverage a high-bandwidth drop-in engine and only needs to wire it properly. This requires much less data engineers than building a poorman's version of the same inhouse.
We also have the R use case, e.g., to write reports in Rmarkdown that crunch some customer security telemetry, highlighting outliers or other noteworthy events. We're not there yet, but with the right query backend, I would expect to get this almost for free. We're close to being ready to use Arrow Flight for interop, but it's not zero-copy. DuckDB has demonstrated the zero-copy approach recently [4], going through the C API. (The story is also relevant when doing s/R/Python/, FWIW.)
[1] https://github.com/tenzir/vast [2] https://github.com/substrait-io/substrait [3] https://github.com/tenzir/vast/tree/master/cloud/aws [4] https://duckdb.org/2021/12/03/duck-arrow.html
Tidyverse (R) is superior for data exploration but R is not fun to deploy and make complex multi job data pipelines.
Are folks who say this not using containers? R has been at least as easy to dockerize as Python since whenever Rocker started, only easier with more recent package management options. Once dockerized, my only R complaints are around logging inconsistencies.
I used to think the culture around R meant that productionizing arbitrary code was harder on average than in Python...but years of suffering with the pandas API has me thinking the opposite these days.
I can trust a junior R dev to write reusable pure functions but can't trust a senior Python dev to do the same!
This means you don't just have to pin your R package versions, you have to pin all the build dependencies.
And you have to have a different image for different sets of R packages because they might have different build dependencies.