undefined | Better HN

0 pointssantiagobasulto3y ago0 comments

How do you use DuckDB in production for a company? Store SQLite file in something like S3, sync it once per day and run DuckDB with it?

0 comments

chrisjc3y ago

While DuckDB is an exciting and amazing project, I think the world that will open up around it is just as exciting, and these are exactly the kinds of questions that get me excited.

DuckDB is to Snowflake/BigQuery/DataBricks/etc...

what

sqlite is to MySQL/Postgres/Oracle/etc... (let's ignore for the moment that Postgres and Oracle have HTAP modes)

In other words, I don't think DuckDB aims to replace or compete against the big OLAP products/services such as Snowflake, BigQuery, DataBricks. Instead it's a natural and complementary component in the analytical stack.

Of course you'll see in the numerous blogs about how amazing it is for data exploration, wrangling, jupyter, pandas, etc... but personally I think the questions about how it could be used in production use-cases a lot more fascinating.

Data warehouses can become quite expensive to run and operate when you either have to allow

1) front-end analytical applications to connect to them directly to do analytics on the fly, or

2) if you pre-calculate ALL the analytics (whether they're used or not) that are offloaded to a cheaper and "faster" OLTP system.

I'm excited about how DuckDB can sort of bridge these two solutions.

1) Prepare semi-pre-calculated data on your traditional data warehouse. (store in internal table or external table like iceberg, delta, etc)

2) Ingest the subsets of this data needed for different production workloads in to DuckDB for last-mile analytics and slicing/dicing.

DuckDb could either interact with your

1) push-down queries to internal tables via their database scanners (arrow across the wire. postgres_scanner, hopefully more to come), or

2) prune external tables (iceberg, delta, etc) to get the subsets (interact with catalogs) of semi-pre-calculated analytical data on demand. Think intelligently partitioned parquet files on S3.

Last-mile analytics, pagination, etc can all be done within DuckDb either directly on your browser (WASM) or on the edge with something like AWS Lambda. This could and hopefully will result in reducing the cost of keeping data warehouses around to serve up fully pre-calculated analytics to consumers as well as reducing the complexity of your analytics stack/arch.

amalter3y ago

Do you work on my team? This is exactly how we're using Duckdb with Databricks as the massive data bulldozer and Duckdb as the scalpel.

chrisjc3y ago

Definitely not since we use Snowflake, not Databricks. I'd love to hear more about your solution though!

wiredfool3y ago

More like -- you have a bunch of parquet/csv files in s3 (data lake/house/shore/party/whatever), and duckdb can query them using sql, from python bindings or via a cli.

tylerhannan3y ago

It's an interesting question... DuckDB is a library. Just like SQLite is a library.

It's not designed for concurrent queries, clients connecting to a database, etc. I know there will be companies built around that problem space.

If "serverless" database is a thing, is there a category of software that is "production-less"? :)

(The above is a joke, lol.)

j / k navigate · click thread line to collapse

0 comments

chrisjc3y ago

While DuckDB is an exciting and amazing project, I think the world that will open up around it is just as exciting, and these are exactly the kinds of questions that get me excited.

DuckDB is to Snowflake/BigQuery/DataBricks/etc...

what

sqlite is to MySQL/Postgres/Oracle/etc... (let's ignore for the moment that Postgres and Oracle have HTAP modes)

Data warehouses can become quite expensive to run and operate when you either have to allow

1) front-end analytical applications to connect to them directly to do analytics on the fly, or

2) if you pre-calculate ALL the analytics (whether they're used or not) that are offloaded to a cheaper and "faster" OLTP system.

I'm excited about how DuckDB can sort of bridge these two solutions.

1) Prepare semi-pre-calculated data on your traditional data warehouse. (store in internal table or external table like iceberg, delta, etc)

2) Ingest the subsets of this data needed for different production workloads in to DuckDB for last-mile analytics and slicing/dicing.

DuckDb could either interact with your

1) push-down queries to internal tables via their database scanners (arrow across the wire. postgres_scanner, hopefully more to come), or

2) prune external tables (iceberg, delta, etc) to get the subsets (interact with catalogs) of semi-pre-calculated analytical data on demand. Think intelligently partitioned parquet files on S3.

amalter3y ago

Do you work on my team? This is exactly how we're using Duckdb with Databricks as the massive data bulldozer and Duckdb as the scalpel.

chrisjc3y ago

Definitely not since we use Snowflake, not Databricks. I'd love to hear more about your solution though!

wiredfool3y ago

More like -- you have a bunch of parquet/csv files in s3 (data lake/house/shore/party/whatever), and duckdb can query them using sql, from python bindings or via a cli.

tylerhannan3y ago

It's an interesting question... DuckDB is a library. Just like SQLite is a library.

It's not designed for concurrent queries, clients connecting to a database, etc. I know there will be companies built around that problem space.

If "serverless" database is a thing, is there a category of software that is "production-less"? :)

(The above is a joke, lol.)

j / k navigate · click thread line to collapse