Friendlier SQL with DuckDB (opens in new tab)

(duckdb.org)

366 pointshfmuehleisen4y ago133 comments

133 comments

Lots of great additions. I will just highlight two:

Column selection:

When you have tons of columns these become useful. Clickhouse takes it to the next level and supports APPLY and COLUMN in addition to EXCEPT, REPLACE which DuckDB supports:

    - APPLY: apply a function to a set of columns
    - COLUMN: select columns by matching a regular expression (!)

Details here: https://clickhouse.com/docs/en/sql-reference/statements/sele...

Allow trailing commas:

I can't count how many times I've run into a problem with a trailing comma. There's a whole convention developed to overcome this: the prefix comma convention where you'd write:

    SELECT
      first_column
      ,second_column
      ,third_column

which lets you easily comment out a line without worrying about trailing comma errors. That's no longer necessary in DuckDB. Allowing for trailing commas should get included in the SQL spec.

nicoburns4y ago

> Allowing for trailing commas should get included in the SQL spec.

Yep! That would be my #1 request for SQL. Seems ridiculous that it's not supported already.

layer84y ago

I agree, though you can always use a dummy value as a workaround:

  SELECT
      first_column,
      second_column,
      third_column,
      null

_dark_matter_4y ago

BigQuery also supports trailing commas!

snidane4y ago

Allow referencing columns defined previously in the same query would make duckdb competitive for data analytics. Without that one has to chain With statements for just the tiniest operations.

  select 1 as x, x + 2 as y, y/x as z;

flakiness4y ago

There is a bug for that and it looks someone is even working on it. https://github.com/duckdb/duckdb/issues/1547

karmakaze4y ago

There's also no need to make it left to right usage, as long as it's acyclic:

  select y-2 as x, 3 as y, y/x as z;

1 more reply

1egg0myegg04y ago

Yes, good thought! That is listed at the bottom of the article as something we are looking at for the future.

gigatexal4y ago

Yep! Agreed!

zasdffaa4y ago

(nothing to do with DuckDB but..) SQL is complex enough, and allowing this (and acyclically as mentioned below) would do my $%^& nut implementing it.

But I know a user requirement when I hear one, so can you give me an large, real example of where allowing this would make things easier? That would be mega helpful, ta

wruza4y ago

SQL is complex enough

No, it is not. I mean it is, but not in parts where that could be seen as useful and/or convenient. [A]cyclic graph traversal/etc is one of the basic tests in a modern interview at any CRUD studio. How come it could do $%^& to any part of yours?

1 more reply

karmakaze4y ago

The thing that makes SQL simple for me is that I can think in set operations devoid of proceduralness. Once we make things more and more sequential the more it is like programming than a formula.

layer84y ago

What does "do my $%^& nut" even mean? (looks like Perl ;))

1 more reply

yread4y ago

for example

    select id, count(...something complicated) as complicated_count
    from ....
    order by complicated_count

would help

2 more replies

gregmac4y ago

Yes, trailing commas should work everywhere!

JSON is the other one where it annoys me, but luckily I rarely hand-write any JSON anymore (and there are semi-solutions for this like json5).

In code I always add trailing commas to anything comma-separated. It makes editing simpler (you can shuffle lines without thinking about commas). In a diff or blame it doesn't show adding a comma as a change.

SQL is the one spot where this doesn't work, and it's a constant foot-gun as I often don't remember until I run and get a syntax error.

skrtskrt4y ago

JSONC allows comments and trailing commas, but adoption seems to be low.

VSCode uses it for configuration, but when I wanted to use it in Python (to add context to source-controlled Elasticsearch schemas) there were only a couple old barely-maintained libraries for parsing.

nicoburns4y ago

You can often find better maintained libraries for json5, which is a superset of jsonc

yunohn4y ago

> there were only a couple old barely-maintained libraries for parsing.

Do they work, though? If it’s a mostly stable standard, doesn’t seem like you’d need a frequently updated parser.

go_prodev4y ago

EXCEPT columns would get my vote for ansi standard SQL adoption. So much time is spent selecting all but a few columns.

skeeter20204y ago

You can do the same thing with your WHERE clause and ANDs by always starting them WHERE 1=1 as well.

>> Allowing for trailing commas should get included in the SQL spec.

So there is no "SQL spec" per se, there's an ANSI specification with decades of convention and provider-specific customizations piled on top. This support for trailing commas is the best you're going to get.

1egg0myegg04y ago

Thank you for the feedback! I will check those Clickhouse features out. I totally agree on the trailing commas, and I use commas first syntax for that same reason! But maybe not anymore... :-)

franga20004y ago

> Allowing for trailing commas should get included in the SQL spec

Not just SQL, trailing commas are stupidly useful and convenient, so as far as I'm concerned every language should have them. To be fair, a decent amount of them have implemented them (I was pleasantly surprised by GCC C), but there are still notable holdouts (JSON!).

throw_away4y ago

Are leading commas allowed? Because otherwise, you've just traded out the inability to comment out the last element for the inability to comment out the first. I never understood this convention.

sagarm4y ago

I agree that it's ugly and don't use it myself, but I find that I modify the last item in a list far more frequently than the first. Probably because the grouping columns tend to go first by convention, and these change less.

IshKebab4y ago

Matching columns by regular expression sounds like a terrible feature. Talk about bug-prone!

Timpy4y ago

Wow this was definitely a pessimistic click for me, I was thinking "trying to replace SQL? How stupid!" But it just looks like SQL with all the stuff you wish SQL had, and some more stuff you didn't even know you wanted.

eis4y ago

I was just yesterday exploring DuckDB and it looked very promising but I was very surprised to find out that indexes are not persisted (and I assume that means they must fit in RAM).

> Unique and primary key indexes are rebuilt upon startup, while user-defined indexes are discarded.

The second part with just discarding previously defined indexes is super surprising.

https://duckdb.org/docs/sql/indexes

This was an instant showstopper for me or I assume most people whose databases grow to a bigger size at which point an OLAP DB becomes interesting in the first place.

Also the numerous issues in Github regarding crashes make me hesitant.

But I really like the core idea of DuckDB being a very simple codebase with no dependencies and still providing very good performance. I guess I just would like to see more SQLite-esque stability/robustness in the future and I'll surely revisit it at some point.

pholanda4y ago

Hey eis, you are correct. We do not support index storage in our latest release. I am currently implementing this, and it is in a fairly advanced stage. So it should be featured in the next release.

This took a little while because we use a fairly modern index structure with no literature definition on how to buffer manage it.

eis4y ago

It's good to hear that persistent indexes are coming soon. I saw it was on the roadmap but didn't know how far out this feature was. Do you have an idea when that release could be out?

BTW Do you have some kind of code/docs one can take a look at regarding the index structure? I'm a part-time data structure nerd :)

1egg0myegg04y ago

Here is a paper that was used when building the DuckDB approach!

https://db.in.tum.de/~leis/papers/ART.pdf

There are some other papers and details about the architecture here: https://duckdb.org/why_duckdb

nojvek4y ago

Will the persistent indices still be built on ART indexes? ART is a great choice of data structure. Fast lookups, range queries and prefix queries.

Typesense DB (Typesense.org) under the hood also uses ART as the base datastructure for fast full text search queries.

So I assume it would be straightforward to build an text search engine on top of duck DB and utilize ART indices to do even more lovely things.

1egg0myegg04y ago

Persistent indexes are being actively worked on! Stay tuned. As for the crashes - DuckDB is very well tested and used in production in many places. The core functionality is very mature! Let us know if you test it out! Happy to help if I can.

(disclaimer - on the DuckDB team)

eis4y ago

Hi, good to hear that you guys care about testing. One thing apart from the Github issues that led me to believe it might not be super stable yet was the benchmark results on https://h2oai.github.io/db-benchmark/ which make it look like it couldn't handle the 50GB case due to a out of memory error. I see that the benchmark and the used versions are about a year old so maybe things changed a lot since then. Can you chime in regarding the current story of running bigger DBs like 1TB on a machine with just 32GB or so RAM? Especially regardung data mutations and DDL queries. Thanks!

1egg0myegg04y ago

Yes, that benchmark result is quite old in Duck years! :-)

We actually run that benchmark as a part of our test suite now, so I am certain that there is improvement from that version.

The biggest DuckDB I've used so far was about 400 GB on a machine with ~250 GB of RAM.

There is ongoing work that we are treating as a high priority for handling larger-than-memory intermediate results within a query. But we can handle larger than RAM in many cases already - we sometimes run into issues today if you are joining 2 larger than RAM tables together (depending on the join), or if you are aggregating a larger than RAM table with really high cardinality in one of the columns you are grouping on.

Would you be open to testing out your use case and letting us know how it goes? We always appreciate more test cases!

eyelidlessness4y ago

Often efforts and articles like this feel like minor affordances which don’t immediately jump out as a big deal, even if they eventually turn out to be really useful down the road. Seeing the article title, that’s what I expected. I did not expect to read through the whole thing with my inner voice, louder and more enthusiastically, saying “yes! this!” Very cool.

tosh4y ago

How does DuckDB compare to SQLite (e.g. which workloads are a good fit for what? Would it be a good idea to use both?)

I found https://duckdb.org/why_duckdb but I'm sure someone here can share some real world lessons learned?

1egg0myegg04y ago

Excellent question! I'll jump in - I am a part of the DuckDB team though, so if other users have thoughts it would be great to get other perspectives as well.

First things first - we really like quite a lot about the SQLite approach. DuckDB is similarly easy to install and is built without dependencies, just like SQLite. It also runs in the same process as your application just like SQLite does. SQLite is excellent as a transactional database - lots of very specific inserts, updates, and deletes (called OLTP workloads). DuckDB can also read directly out of SQLite files as well, so you can mix and match them! (https://github.com/duckdblabs/sqlitescanner)

DuckDB is much faster than SQLite when doing analytical queries (OLAP) like when calculating summaries or trends over time, or joining large tables together. It can use all of your CPU cores for sometimes ~100x speedup over SQLite.

DuckDB also has some enhancements with respect to data transfer in and out of it. It can natively read Pandas, R, and Julia dataframes, and can read parquet files directly also (meaning without inserting first!).

Does that help? Happy to add more details!

OskarS4y ago

One of SQLite's most appealing aspects to me is using it as an application file format, as described in this article: https://www.sqlite.org/appfileformat.html

How does DuckDB compare in that aspect? Does it have the same kind of guarantees of robustness, incorruptibility and performance (especially reading/writing binary blobs) that SQLite does?

In any case: DuckDB looks great, nice work! Good to have more players in this space!

1egg0myegg04y ago

It is a goal of ours to become a standard multi-table storage format! However, today we are still in beta and have made some breaking changes in the last few releases. (Exporting from the old version, then reimporting your DB in the new allows you to upgrade!) Those should happen less often as we move forward (the storage format was genericized a bit and is more resilient to future enhancements now), and locking in our format amd guaranteeing backwards compatibility will occur when we go to 1.0!

tosh4y ago

Thanks, it does help! I understand SQLite might be better/ideal for OLTP (?) but would DuckDB also work for use cases where I query for specific records (e.g. based on primary key) or would I rather use SQLite for OLTP stuff and then read SQLite from DuckDB for analytical workloads?

Basically I'm wondering: if I go all in on DuckDB instead of SQLite would I notice? Do I have to keep anything in mind?

I know, probably difficult to answer without a concrete example of data, schema, queries and so on.

The SQL query features in the article seem really neat. Kudos @ shipping.

1egg0myegg04y ago

Good questions! You are correct that it depends. We do have indexes to help with point queries, but they are not going to be quite as fast as SQLite because DuckDB stores data in a columnar format. (Soon they will be persistent - see comments above!) That columnar format is really great for scanning many items, but not optimal for grabbing all of a single row.

With DuckDB, bulk inserts are your friend and are actually super fast.

Definitely let us know what you find! Just open up a discussion on Github if you'd like to share what you find out: https://github.com/duckdb/duckdb/discussions

1egg0myegg04y ago

I should add that we can read/write Apache Arrow as well!

enjalot4y ago

one thing I love about DuckDB is that it supports Parquet files, which means you can get great compression on the data. Here's an examples getting a 1 million row CSV under 50mb and interactive querying in the browser: https://observablehq.com/@observablehq/bandcamp-sales-data?c...

the other big thing is better native data types, especially dates. With SQLite if you want to work with timeseries you need to do your own date/time casting.

1egg0myegg04y ago

Yes, it is always difficult to use dates in SQLite... DuckDB makes dates easier - like they should be!

eatonphil4y ago

DuckDB: embedded OLAP, SQLite: embedded OLTP. For small datasets (<1M rows let's say) either would be similar in performance.

I need to do the benchmarks to substantiate this but this is my intuition.

IshKebab4y ago

I haven't used DuckDB yet but the biggest differences I've discovered if you aren't working on huge datasets where the column/row thing makes a difference (you're probably not) are:

1. SQLite has a great GUI and is really really widely supported.

2. DuckDB is properly statically typed with a much wider range of types than SQLite, which is dynamically typed and only just added support for any kind of type checking at all.

1egg0myegg04y ago

If you'd like to work with DuckDB in a SQL IDE/GUI, we recommend DBeaver! It uses the DuckDB JDBC connector. A quick how to guide is here: https://duckdb.org/docs/guides/sql_editors/dbeaver

wenc4y ago

> huge datasets where the column/row thing makes a difference (you're probably not)

For programmers, it’s a tossup.

For most people working in data (databases, data engineering, ML etc) the column vs row thing makes a difference for datasets as small as a few hundred k records.

ergocoder4y ago

Sqlite's SQL is severely limited.

If you want to do something a bit more complex, you will have a bad time. Hello! With recursive.

mattrighetti4y ago

`EXCLUDE`

Extremely useful, is there a reason why this is something not implemented in SQL in the first place? I often find myself writing very long queries just to select basically all columns except for two or three of them.

heyda4y ago

because columns can be added to tables in production databases, so any time you use select * you run the chance the number of columns changing and breaking anything you wrote.

wruza4y ago

This seems reasonable on its own, but then you can add a compound index and forget to join on a second part, or refactor a column in two and only collect one value into aggregation. This spotted babysitting is just stupid. If you’re anxious about query integrity, get some tooling and check your sqls/ddls against some higher-level schema.

Even if that turns out to be a constant source of trouble worth not having, then why SQL can’t provide columnsets at least, so that queries could include, group or join on these predefined sets of columns instead of repeating tens of columns and/or expressions and/or aggregations many times across a single query. You had employees.bio_set=(name, dob), now you add `edu` to it and it just works everywhere, because you think in sets rather than in specific columns. Even group by bio_set works. Heck, I bet most of ORMs partially exist only to generate SQL, because it’s sometimes unbearable as is.

justin_oaks4y ago

That's a problem with select * in general, not a problem with using EXCLUDE with select *. So that still doesn't explain why it's not in SQL to begin with.

munk-a4y ago

I've always viewed SELECT * as a convinence for schema discovery and a huge bonus for subqueries - our shop excludes its use at the top level in production due to the danger of table definitions changing underneath... but we happily allow subqueries to use SELECT * so long as that column list is clearly defined before we leave the database.

Worst, by far, than a column you didn't expect being added is a column you did expect being removed. Depending on how thorough your integration tests are (and ideally they should be pretty thorough) you could suddenly start getting strange array key access (or object key unfound) errors somewhere on the other side of the codebase.

1 more reply

heyda4y ago

"A traditional SQL SELECT query requires that requested columns be explicitly specified, with one notable exception: the * wildcard. SELECT * allows SQL to return all relevant columns. This adds tremendous flexibility, especially when building queries on top of one another. However, we are often interested in almost all columns. In DuckDB, simply specify which columns to EXCLUDE:"

It appears how this works is that is selects all columns and then EXCLUDES only the column's specified, the reason this doesn't exist in normal SQL is because it is a terrible idea. This is something that will break at many companies with large technical debt if it is ever used.

1 more reply

gmueckl4y ago

That's why good database wrappers support referencing columns in result sets by column name. It's good practice.

aerzen4y ago

If anyone is interested in improvements to SQL, checkout PRQL https://github.com/prql/prql, a pipelined relational query language.

It supports:

- functions,

- using an alias in same `select` that defined it,

- trailing commas,

- date literals, f-strings and other small improvements we found unpleasant with SQL.

https://lang.prql.builders/introduction.html

The best part: it compiles into SQL. It's under development, though we will soon be releasing version 0.2 which would be "you can check it out"-version.

lijogdfljk4y ago

This is neat. Have you found this new found capability at odds with "good SQL"? Eg, i run a fairly large application that has a huge DB schema, and more often than not when the SQL gets huge and ugly it often means we're asking too much of the DB. "Too much" being more easy to run into poor indexes, giving more chances for it to pull in unexpectedly large number of rows, etc.

My fear with PRQL is that i'd more easily ask too much of the DB, given how easy it looks to write larger and more complex SQL. Thoughts?

aljazmerzen4y ago

That's true - when you hit 4th CTE you are probably doing something wrong.

But not always. Some analytical queries may actually need such complexity. Also, during development, you would sometimes pick only first 50 rows before joining and grouping, with intention of not overloading the db. To do this you need a CTE (or nested select), but in PRQL you just add a `take 50` transform to the top.

learndeeply4y ago

Since the DuckDB people are here, just want to say that what you're doing is going to be a complete game-changer in the next few years, much like SQLite changed the game. Thanks for making it open source!

wolf550e4y ago

That 750KB PNG can probably be a 50KB PNG. Even without resizing it compresses to less than half its size.

https://duckdb.org/images/blog/duck_chewbacca.png

1egg0myegg04y ago

Thanks! Can you tell that my SQL-fu is stronger than my HTML-fu? :-) Much appreciated!

andai4y ago

In this case it should probably be a JPEG? (Unless it has a transparent background and the site responds to the user's dark-mode setting? :) Also, this image looks like it almost certainly was a JPEG, at some point!)

tracker14y ago

The color palette is pretty limited, so using something that can have a specific/limited palette (say 48-64 color in this case) is probably going to have a better result than jpeg. Also, optimizing for display size would take it further still. Alpha transparency support is also a bonus for png over jpeg.

Oxodao4y ago

Came across this a few time but never got to try it out because the only golang binding is unofficial and I can't get CGO to work as expected...

That would be really neat to have an official one. This articles makes me want to try it even more

1egg0myegg04y ago

Here is a solved Github Issue related to CGO for the Go bindings! If you have another issue, please feel free to post it on their Github page!

https://github.com/marcboeker/go-duckdb/issues/4

Oxodao4y ago

Thanks! I didn't see that, I'll give it a try again!

nlittlepoole4y ago

Ran into some similar issues with those bindings. We switched to using the ODBC drivers and its been great. Those are official and we just use an ODBC library in go to connect

_raoulcousins4y ago

I love love love DuckDB. When I can use DuckDB + pyarrow and not import pandas, it makes my day.

kristianp4y ago

I'm enjoying experimenting with Duckdb from python, it's a promising product and has a large list of data formats it can read, including pandas dataframes from in-memory with zero-copy. However its still quite the moving target, with a number of things not at maturity yet. e.g. the TimestampZ column type isn't implemented yet [1], although it is in the documentation.

Edit: I came across it via the podcast: https://www.dataengineeringpodcast.com/duckdb-in-process-ola...

Latest release notes: https://github.com/duckdb/duckdb/releases/tag/v0.3.3

[1] Error message: Not implemented Error: DataType TIMESTAMPZ not supported yet...

mytherin4y ago

The correct type is TIMESTAMPTZ. Is the type TimestampZ mentioned anywhere in our documentation? If so, that looks like a typo.

I fully agree that error message needs to improve, however. I will have a look at that.

kristianp4y ago

You're right, I was using 'timestampz' when I should have been using 'timestamptz', (or 'timestamp with time zone') thanks for that.

mytherin4y ago

On the topic of Friendlier SQL, I have extended the similarity search to types in this PR [1], so the system will now offer you this correction as well :)

[1] https://github.com/duckdb/duckdb/pull/3633

wenc4y ago

This is fantastic. Column aliases are super helpful in reducing verbose messiness.

DuckDB has all but replaced Pandas for my use cases. It’s much faster than Pandas even when working with Pandas data frames. I “import duckdb as db” more than I “import pandas as pd” these days.

The only thing I need now is a parallelized APPLY syntax in DuckDB.

1egg0myegg04y ago

Fugue has a DuckDB back end and I believe they can actually use Dask and DuckDB in combination for what I believe is similar to what you are looking for! There is also a way to map Python functions in DuckDB using the relational (dataframe-like) API.

https://fugue-tutorials.readthedocs.io/tutorials/integration...

https://github.com/duckdb/duckdb/pull/1569

fijiaarone4y ago

  SELECT * EXCLUDE (jar_jar_binks, midichlorians) FROM star_wars

Error: columns not found

On further investigation, It seems that someone had maliciously injected lots of bogus data into the production database. We tried to clean up by truncating tables and dropping columns, but in the end it was easier to just restore from backup prior to 1999.

There still seems to be some residual corruption, most predominantly around mos_eisley and jabbas_palace data, and we had to truncate the end of Return of the Jedi, but not much was lost there.

chrisjc4y ago

What are some potential long-term liabilities we might see in choosing to adopt duckdb today?

Obviously there will be a desire to monetize this project, if not for the very simple reason of subsidizing the cost of its development and maintenance. I love everything I hear and see about this project, but it makes me nervous to recommend this internally due to it not only being in such an early stage, but also bc of any unforeseen costs and liabilities that it might introduce in the future.

1egg0myegg04y ago

Let me see if I can assuage some of your concerns!

First off - DuckDB is MIT licensed, so you are welcome to use and enhance it essentially however you please!

DuckDB Labs is a commercial entity that offers commercial support and custom integrations. (https://duckdblabs.com/). If the MIT DuckDB works for what you need, then you are all set no matter what!

However, much of the IP for DuckDB is owned by a foundation, so it is independent of that commercial entity. (https://duckdb.org/foundation/)

Does that help? Happy to answer any other questions!

chrisjc4y ago

Absolutely. I think DuckDB's future is bright and excited to hopefully work with it in the near future.

projektfu4y ago

I find that the examples are very confusing because they are using names that sound like rows or tables (jar_jar_binks, planets) as fields in the examples.

1egg0myegg04y ago

Ah, well, that was a risk that I took... Thank you for the feedback though! The Star Wars puns were too hard to resist...

If you have a specific question, definitely post it here and I will clarify!

petepete4y ago

I'd agree. Took me a couple of reads to make sense of it.

Keep the puns, I just think with a bit of adjustment the examples would be easier to understand.

tln4y ago

These are great features! I wish I had them in every database. Hmm, I wonder if Babelfish could support that...

PS those examples were so good! really good writing :)

flakiness4y ago

I love the attitude towards ergonomics over standard compliance. And you'll see why SQL has never been really portable across databases ;-)

db65edfc79964y ago

Maybe if SQL was a better language at the start, there would be more incentive to follow the spec.

zasdffaa4y ago

IIRC there was am ANSI/FIPS standard which was dropped under the bill clinton's administration, which is when things started to diverge.

(Info from memory, may be wrong or a bit mangled)

throwawayboise4y ago

SQL is an amazingly good language for what it does. It has been with us coming up on 5 decades.

lijogdfljk4y ago

Minus everyone having different versions of it to facilitate the missing functionality/needs of users

aerzen4y ago

Well, it's not portable, but you don't have to learn a new language every time you encounter a new DBMS (unlike MongoDB or InfluxDB).

carlineng4y ago

I love these updates. It would be great to see some of the major data warehouse vendors (Snowflake, BigQuery, Redshift) follow suit.

foxbee4y ago

This is awesome and would love to chat around building an integration to the low-code platform Budibase: https://github.com/Budibase/budibase

getravi4y ago

I would go even further and say that "GROUP BY ALL" and "ORDER BY ALL" should be implied if not provided in the query.

EDIT: Typo

mytherin4y ago

We thought about that, but particularly with `GROUP BY ALL` the problem is that we would get different results from SQLite. When a column is not mentioned in the `GROUP BY` clause in SQLite it automatically pushes a `FIRST` aggregate over that column. So for example, the following query:

  SELECT city, COUNT(*)
  FROM customers

In SQLite is transformed into:

  SELECT FIRST(city), COUNT(*)
  FROM customers

In our experience this is not a good default since it is almost never what you want, and hence we did not copy this behavior and instead throw an error in this situation. However, if we were to add an implicit `GROUP BY ALL` our transformed queries would now diverge, i.e. we would transform the above query to:

  SELECT city, COUNT(*)
  FROM customers
  GROUP BY city

Having diverging query results from SQLite on quite basic queries would confuse a lot of newcomers in DuckDB, and potentially cause silent problems when query semantics change when switching databases.

We could definitely add a flag to enable this behavior, however.

kristianp4y ago

I vaguely remember that this is what mysql does?

ashes4y ago

I've been experimenting with DuckDB using modified Mondrian OLAP engine and it looks very promising so far, performance wise.

A questions I have to author, or anyone using: Is there a easy way to transfer whole Postgres DB into DuckDB so I can do some tests with actual client data? I could export each table by hand and reimport it, but that is kind of painful.

1egg0myegg04y ago

Interesting thought! I have not tried this yet so I only have a guess as an answer. Could you export the data as SQL statements and then run those statements on DuckDB? That may be easier to set up, but may take longer to run...

DuckDB also has the ability to read Postgres data directly, and there is a Postgres FDW that can read from DuckDB!

https://github.com/duckdblabs/postgresscanner

https://github.com/alitrack/duckdb_fdw

nojvek4y ago

Trailing commas and “GROUP BY ALL” is such a huge improvement. Some of this should start making it to other databases.

parentheses4y ago

Though many of the queries don’t make complete sense the mapping of queries to Star Wars is :chefkiss:

yarg4y ago

On the topic of friendlier SQL, there was a feature LINQ to SQL added to (and I believe removed from) .Net

It was basically syntactic sugar for a persistence API.

Instead of "select bar from foo" it used a "from foo select bar" type of syntax.

This was rather nice from a code completion perspective.

padmi4y ago

Friendlier sql is MySQL "insert into set".

Normal insert, hard to read:

INSERT INTO table1 ( field1, field2, field3, ... ) VALUES ('value1', 'value2', 'value3', ... );

Easier to read:

INSERT INTO table1 SET field1='value1', field2='value2', field3='value3', ...

zasdffaa4y ago

This looks a little odd

   SELECT age, sum(civility) as total_civility
   FROM star_wars_universe
   ORDER BY ALL
   -- ORDER BY age, total_civility

there's no GROUP BY?

edit: (removed edit, I blew it, sorry)

1egg0myegg04y ago

Good catch! Fixed! Needed a group by all in there.

TedDallas4y ago

Does it support a syntax for recursive queries? In T-SQL we use recursive CTEs which are ugly as hell.

This is very cool though. There are lot of features that would make my life easier. Group By All is noice.

hfmuehleisenOP4y ago

It does, DuckDB supports recursive CTEs

sagarm4y ago

What do you use recursive queries for?

diogofranco4y ago

Interesting additions! On using column aliases in predicates, what if my alias exists in the source as well, what takes precedence? I feel like this can become a bit confusing either way.

1egg0myegg04y ago

For compatibility, the original column in the data source takes precedence. That's how other DBs handle things so we wanted to stay standard where it made sense!

pxtail4y ago

Wow so many nice database-related news recently - feels like database week or something! :)

j / k navigate · click thread line to collapse

133 comments

roncohen4y ago

Lots of great additions. I will just highlight two:

Column selection:

When you have tons of columns these become useful. Clickhouse takes it to the next level and supports APPLY and COLUMN in addition to EXCEPT, REPLACE which DuckDB supports:

    - APPLY: apply a function to a set of columns
    - COLUMN: select columns by matching a regular expression (!)

Details here: https://clickhouse.com/docs/en/sql-reference/statements/sele...

Allow trailing commas:

I can't count how many times I've run into a problem with a trailing comma. There's a whole convention developed to overcome this: the prefix comma convention where you'd write:

    SELECT
      first_column
      ,second_column
      ,third_column

which lets you easily comment out a line without worrying about trailing comma errors. That's no longer necessary in DuckDB. Allowing for trailing commas should get included in the SQL spec.

nicoburns4y ago

> Allowing for trailing commas should get included in the SQL spec.

Yep! That would be my #1 request for SQL. Seems ridiculous that it's not supported already.

layer84y ago

I agree, though you can always use a dummy value as a workaround:

  SELECT
      first_column,
      second_column,
      third_column,
      null

_dark_matter_4y ago

BigQuery also supports trailing commas!

snidane4y ago

Allow referencing columns defined previously in the same query would make duckdb competitive for data analytics. Without that one has to chain With statements for just the tiniest operations.

  select 1 as x, x + 2 as y, y/x as z;

flakiness4y ago

There is a bug for that and it looks someone is even working on it. https://github.com/duckdb/duckdb/issues/1547

karmakaze4y ago

There's also no need to make it left to right usage, as long as it's acyclic:

  select y-2 as x, 3 as y, y/x as z;

1 more reply

1egg0myegg04y ago

Yes, good thought! That is listed at the bottom of the article as something we are looking at for the future.

gigatexal4y ago

Yep! Agreed!

zasdffaa4y ago

(nothing to do with DuckDB but..) SQL is complex enough, and allowing this (and acyclically as mentioned below) would do my $%^& nut implementing it.

But I know a user requirement when I hear one, so can you give me an large, real example of where allowing this would make things easier? That would be mega helpful, ta

wruza4y ago

SQL is complex enough

1 more reply

karmakaze4y ago

The thing that makes SQL simple for me is that I can think in set operations devoid of proceduralness. Once we make things more and more sequential the more it is like programming than a formula.

layer84y ago

What does "do my $%^& nut" even mean? (looks like Perl ;))

1 more reply

yread4y ago

for example

    select id, count(...something complicated) as complicated_count
    from ....
    order by complicated_count

would help

2 more replies

gregmac4y ago

Yes, trailing commas should work everywhere!

JSON is the other one where it annoys me, but luckily I rarely hand-write any JSON anymore (and there are semi-solutions for this like json5).

SQL is the one spot where this doesn't work, and it's a constant foot-gun as I often don't remember until I run and get a syntax error.

skrtskrt4y ago

JSONC allows comments and trailing commas, but adoption seems to be low.

nicoburns4y ago

You can often find better maintained libraries for json5, which is a superset of jsonc

yunohn4y ago

> there were only a couple old barely-maintained libraries for parsing.

Do they work, though? If it’s a mostly stable standard, doesn’t seem like you’d need a frequently updated parser.

go_prodev4y ago

EXCEPT columns would get my vote for ansi standard SQL adoption. So much time is spent selecting all but a few columns.

skeeter20204y ago

You can do the same thing with your WHERE clause and ANDs by always starting them WHERE 1=1 as well.

>> Allowing for trailing commas should get included in the SQL spec.

1egg0myegg04y ago

Thank you for the feedback! I will check those Clickhouse features out. I totally agree on the trailing commas, and I use commas first syntax for that same reason! But maybe not anymore... :-)

franga20004y ago

> Allowing for trailing commas should get included in the SQL spec

throw_away4y ago

Are leading commas allowed? Because otherwise, you've just traded out the inability to comment out the last element for the inability to comment out the first. I never understood this convention.

sagarm4y ago

IshKebab4y ago

Matching columns by regular expression sounds like a terrible feature. Talk about bug-prone!

Timpy4y ago

eis4y ago

I was just yesterday exploring DuckDB and it looked very promising but I was very surprised to find out that indexes are not persisted (and I assume that means they must fit in RAM).

> Unique and primary key indexes are rebuilt upon startup, while user-defined indexes are discarded.

The second part with just discarding previously defined indexes is super surprising.

https://duckdb.org/docs/sql/indexes

This was an instant showstopper for me or I assume most people whose databases grow to a bigger size at which point an OLAP DB becomes interesting in the first place.

Also the numerous issues in Github regarding crashes make me hesitant.

pholanda4y ago

Hey eis, you are correct. We do not support index storage in our latest release. I am currently implementing this, and it is in a fairly advanced stage. So it should be featured in the next release.

This took a little while because we use a fairly modern index structure with no literature definition on how to buffer manage it.

eis4y ago

It's good to hear that persistent indexes are coming soon. I saw it was on the roadmap but didn't know how far out this feature was. Do you have an idea when that release could be out?

BTW Do you have some kind of code/docs one can take a look at regarding the index structure? I'm a part-time data structure nerd :)

1egg0myegg04y ago

Here is a paper that was used when building the DuckDB approach!

https://db.in.tum.de/~leis/papers/ART.pdf

There are some other papers and details about the architecture here: https://duckdb.org/why_duckdb

nojvek4y ago

Will the persistent indices still be built on ART indexes? ART is a great choice of data structure. Fast lookups, range queries and prefix queries.

Typesense DB (Typesense.org) under the hood also uses ART as the base datastructure for fast full text search queries.

So I assume it would be straightforward to build an text search engine on top of duck DB and utilize ART indices to do even more lovely things.

1egg0myegg04y ago

(disclaimer - on the DuckDB team)

eis4y ago

1egg0myegg04y ago

Yes, that benchmark result is quite old in Duck years! :-)

We actually run that benchmark as a part of our test suite now, so I am certain that there is improvement from that version.

The biggest DuckDB I've used so far was about 400 GB on a machine with ~250 GB of RAM.

Would you be open to testing out your use case and letting us know how it goes? We always appreciate more test cases!

eyelidlessness4y ago

tosh4y ago

How does DuckDB compare to SQLite (e.g. which workloads are a good fit for what? Would it be a good idea to use both?)

I found https://duckdb.org/why_duckdb but I'm sure someone here can share some real world lessons learned?

1egg0myegg04y ago

Excellent question! I'll jump in - I am a part of the DuckDB team though, so if other users have thoughts it would be great to get other perspectives as well.

Does that help? Happy to add more details!

OskarS4y ago

One of SQLite's most appealing aspects to me is using it as an application file format, as described in this article: https://www.sqlite.org/appfileformat.html

How does DuckDB compare in that aspect? Does it have the same kind of guarantees of robustness, incorruptibility and performance (especially reading/writing binary blobs) that SQLite does?

In any case: DuckDB looks great, nice work! Good to have more players in this space!

1egg0myegg04y ago

tosh4y ago

Basically I'm wondering: if I go all in on DuckDB instead of SQLite would I notice? Do I have to keep anything in mind?

I know, probably difficult to answer without a concrete example of data, schema, queries and so on.

The SQL query features in the article seem really neat. Kudos @ shipping.

1egg0myegg04y ago

With DuckDB, bulk inserts are your friend and are actually super fast.

Definitely let us know what you find! Just open up a discussion on Github if you'd like to share what you find out: https://github.com/duckdb/duckdb/discussions

1egg0myegg04y ago

I should add that we can read/write Apache Arrow as well!

enjalot4y ago

the other big thing is better native data types, especially dates. With SQLite if you want to work with timeseries you need to do your own date/time casting.

1egg0myegg04y ago

Yes, it is always difficult to use dates in SQLite... DuckDB makes dates easier - like they should be!

eatonphil4y ago

DuckDB: embedded OLAP, SQLite: embedded OLTP. For small datasets (<1M rows let's say) either would be similar in performance.

I need to do the benchmarks to substantiate this but this is my intuition.

IshKebab4y ago

I haven't used DuckDB yet but the biggest differences I've discovered if you aren't working on huge datasets where the column/row thing makes a difference (you're probably not) are:

1. SQLite has a great GUI and is really really widely supported.

2. DuckDB is properly statically typed with a much wider range of types than SQLite, which is dynamically typed and only just added support for any kind of type checking at all.

1egg0myegg04y ago

If you'd like to work with DuckDB in a SQL IDE/GUI, we recommend DBeaver! It uses the DuckDB JDBC connector. A quick how to guide is here: https://duckdb.org/docs/guides/sql_editors/dbeaver

wenc4y ago

> huge datasets where the column/row thing makes a difference (you're probably not)

For programmers, it’s a tossup.

For most people working in data (databases, data engineering, ML etc) the column vs row thing makes a difference for datasets as small as a few hundred k records.

ergocoder4y ago

Sqlite's SQL is severely limited.

If you want to do something a bit more complex, you will have a bad time. Hello! With recursive.

mattrighetti4y ago

`EXCLUDE`

heyda4y ago

because columns can be added to tables in production databases, so any time you use select * you run the chance the number of columns changing and breaking anything you wrote.

wruza4y ago

justin_oaks4y ago

That's a problem with select * in general, not a problem with using EXCLUDE with select *. So that still doesn't explain why it's not in SQL to begin with.

munk-a4y ago

1 more reply

heyda4y ago

1 more reply

gmueckl4y ago

That's why good database wrappers support referencing columns in result sets by column name. It's good practice.

aerzen4y ago

If anyone is interested in improvements to SQL, checkout PRQL https://github.com/prql/prql, a pipelined relational query language.

It supports:

- functions,

- using an alias in same `select` that defined it,

- trailing commas,

- date literals, f-strings and other small improvements we found unpleasant with SQL.

https://lang.prql.builders/introduction.html

The best part: it compiles into SQL. It's under development, though we will soon be releasing version 0.2 which would be "you can check it out"-version.

lijogdfljk4y ago

My fear with PRQL is that i'd more easily ask too much of the DB, given how easy it looks to write larger and more complex SQL. Thoughts?

aljazmerzen4y ago

That's true - when you hit 4th CTE you are probably doing something wrong.

learndeeply4y ago

wolf550e4y ago

That 750KB PNG can probably be a 50KB PNG. Even without resizing it compresses to less than half its size.

https://duckdb.org/images/blog/duck_chewbacca.png

1egg0myegg04y ago

Thanks! Can you tell that my SQL-fu is stronger than my HTML-fu? :-) Much appreciated!

andai4y ago

tracker14y ago

Oxodao4y ago

Came across this a few time but never got to try it out because the only golang binding is unofficial and I can't get CGO to work as expected...

That would be really neat to have an official one. This articles makes me want to try it even more

1egg0myegg04y ago

Here is a solved Github Issue related to CGO for the Go bindings! If you have another issue, please feel free to post it on their Github page!

https://github.com/marcboeker/go-duckdb/issues/4

Oxodao4y ago

Thanks! I didn't see that, I'll give it a try again!

nlittlepoole4y ago

Ran into some similar issues with those bindings. We switched to using the ODBC drivers and its been great. Those are official and we just use an ODBC library in go to connect

_raoulcousins4y ago

I love love love DuckDB. When I can use DuckDB + pyarrow and not import pandas, it makes my day.

kristianp4y ago

Edit: I came across it via the podcast: https://www.dataengineeringpodcast.com/duckdb-in-process-ola...

Latest release notes: https://github.com/duckdb/duckdb/releases/tag/v0.3.3

[1] Error message: Not implemented Error: DataType TIMESTAMPZ not supported yet...

mytherin4y ago

The correct type is TIMESTAMPTZ. Is the type TimestampZ mentioned anywhere in our documentation? If so, that looks like a typo.

I fully agree that error message needs to improve, however. I will have a look at that.

kristianp4y ago

You're right, I was using 'timestampz' when I should have been using 'timestamptz', (or 'timestamp with time zone') thanks for that.

mytherin4y ago

On the topic of Friendlier SQL, I have extended the similarity search to types in this PR [1], so the system will now offer you this correction as well :)

[1] https://github.com/duckdb/duckdb/pull/3633

wenc4y ago

This is fantastic. Column aliases are super helpful in reducing verbose messiness.

The only thing I need now is a parallelized APPLY syntax in DuckDB.

1egg0myegg04y ago

https://fugue-tutorials.readthedocs.io/tutorials/integration...

https://github.com/duckdb/duckdb/pull/1569

fijiaarone4y ago

  SELECT * EXCLUDE (jar_jar_binks, midichlorians) FROM star_wars

Error: columns not found

There still seems to be some residual corruption, most predominantly around mos_eisley and jabbas_palace data, and we had to truncate the end of Return of the Jedi, but not much was lost there.

chrisjc4y ago

What are some potential long-term liabilities we might see in choosing to adopt duckdb today?

1egg0myegg04y ago

Let me see if I can assuage some of your concerns!

First off - DuckDB is MIT licensed, so you are welcome to use and enhance it essentially however you please!

DuckDB Labs is a commercial entity that offers commercial support and custom integrations. (https://duckdblabs.com/). If the MIT DuckDB works for what you need, then you are all set no matter what!

However, much of the IP for DuckDB is owned by a foundation, so it is independent of that commercial entity. (https://duckdb.org/foundation/)

Does that help? Happy to answer any other questions!

chrisjc4y ago

Absolutely. I think DuckDB's future is bright and excited to hopefully work with it in the near future.

projektfu4y ago

I find that the examples are very confusing because they are using names that sound like rows or tables (jar_jar_binks, planets) as fields in the examples.

1egg0myegg04y ago

Ah, well, that was a risk that I took... Thank you for the feedback though! The Star Wars puns were too hard to resist...

If you have a specific question, definitely post it here and I will clarify!

petepete4y ago

I'd agree. Took me a couple of reads to make sense of it.

Keep the puns, I just think with a bit of adjustment the examples would be easier to understand.

tln4y ago

These are great features! I wish I had them in every database. Hmm, I wonder if Babelfish could support that...

PS those examples were so good! really good writing :)

flakiness4y ago

I love the attitude towards ergonomics over standard compliance. And you'll see why SQL has never been really portable across databases ;-)

db65edfc79964y ago

Maybe if SQL was a better language at the start, there would be more incentive to follow the spec.

zasdffaa4y ago

IIRC there was am ANSI/FIPS standard which was dropped under the bill clinton's administration, which is when things started to diverge.

(Info from memory, may be wrong or a bit mangled)

throwawayboise4y ago

SQL is an amazingly good language for what it does. It has been with us coming up on 5 decades.

lijogdfljk4y ago

Minus everyone having different versions of it to facilitate the missing functionality/needs of users

aerzen4y ago

Well, it's not portable, but you don't have to learn a new language every time you encounter a new DBMS (unlike MongoDB or InfluxDB).

carlineng4y ago

I love these updates. It would be great to see some of the major data warehouse vendors (Snowflake, BigQuery, Redshift) follow suit.

foxbee4y ago

This is awesome and would love to chat around building an integration to the low-code platform Budibase: https://github.com/Budibase/budibase

getravi4y ago

I would go even further and say that "GROUP BY ALL" and "ORDER BY ALL" should be implied if not provided in the query.

EDIT: Typo

mytherin4y ago

  SELECT city, COUNT(*)
  FROM customers

In SQLite is transformed into:

  SELECT FIRST(city), COUNT(*)
  FROM customers

  SELECT city, COUNT(*)
  FROM customers
  GROUP BY city

We could definitely add a flag to enable this behavior, however.

kristianp4y ago

I vaguely remember that this is what mysql does?

ashes4y ago

I've been experimenting with DuckDB using modified Mondrian OLAP engine and it looks very promising so far, performance wise.

1egg0myegg04y ago

DuckDB also has the ability to read Postgres data directly, and there is a Postgres FDW that can read from DuckDB!

https://github.com/duckdblabs/postgresscanner

https://github.com/alitrack/duckdb_fdw

nojvek4y ago

Trailing commas and “GROUP BY ALL” is such a huge improvement. Some of this should start making it to other databases.

parentheses4y ago

Though many of the queries don’t make complete sense the mapping of queries to Star Wars is :chefkiss:

yarg4y ago

On the topic of friendlier SQL, there was a feature LINQ to SQL added to (and I believe removed from) .Net

It was basically syntactic sugar for a persistence API.

Instead of "select bar from foo" it used a "from foo select bar" type of syntax.

This was rather nice from a code completion perspective.

padmi4y ago

Friendlier sql is MySQL "insert into set".

Normal insert, hard to read:

INSERT INTO table1 ( field1, field2, field3, ... ) VALUES ('value1', 'value2', 'value3', ... );

Easier to read:

INSERT INTO table1 SET field1='value1', field2='value2', field3='value3', ...

zasdffaa4y ago

This looks a little odd

   SELECT age, sum(civility) as total_civility
   FROM star_wars_universe
   ORDER BY ALL
   -- ORDER BY age, total_civility

there's no GROUP BY?

edit: (removed edit, I blew it, sorry)

1egg0myegg04y ago

Good catch! Fixed! Needed a group by all in there.

TedDallas4y ago

Does it support a syntax for recursive queries? In T-SQL we use recursive CTEs which are ugly as hell.

This is very cool though. There are lot of features that would make my life easier. Group By All is noice.

hfmuehleisenOP4y ago

It does, DuckDB supports recursive CTEs

sagarm4y ago

What do you use recursive queries for?

diogofranco4y ago

Interesting additions! On using column aliases in predicates, what if my alias exists in the source as well, what takes precedence? I feel like this can become a bit confusing either way.

1egg0myegg04y ago

For compatibility, the original column in the data source takes precedence. That's how other DBs handle things so we wanted to stay standard where it made sense!

pxtail4y ago

Wow so many nice database-related news recently - feels like database week or something! :)

j / k navigate · click thread line to collapse