Column selection:
When you have tons of columns these become useful. Clickhouse takes it to the next level and supports APPLY and COLUMN in addition to EXCEPT, REPLACE which DuckDB supports:
- APPLY: apply a function to a set of columns
- COLUMN: select columns by matching a regular expression (!)
Details here: https://clickhouse.com/docs/en/sql-reference/statements/sele...Allow trailing commas:
I can't count how many times I've run into a problem with a trailing comma. There's a whole convention developed to overcome this: the prefix comma convention where you'd write:
SELECT
first_column
,second_column
,third_column
which lets you easily comment out a line without worrying about trailing comma errors. That's no longer necessary in DuckDB. Allowing for trailing commas should get included in the SQL spec.Yep! That would be my #1 request for SQL. Seems ridiculous that it's not supported already.
SELECT
first_column,
second_column,
third_column,
null select 1 as x, x + 2 as y, y/x as z;But I know a user requirement when I hear one, so can you give me an large, real example of where allowing this would make things easier? That would be mega helpful, ta
JSON is the other one where it annoys me, but luckily I rarely hand-write any JSON anymore (and there are semi-solutions for this like json5).
In code I always add trailing commas to anything comma-separated. It makes editing simpler (you can shuffle lines without thinking about commas). In a diff or blame it doesn't show adding a comma as a change.
SQL is the one spot where this doesn't work, and it's a constant foot-gun as I often don't remember until I run and get a syntax error.
VSCode uses it for configuration, but when I wanted to use it in Python (to add context to source-controlled Elasticsearch schemas) there were only a couple old barely-maintained libraries for parsing.
>> Allowing for trailing commas should get included in the SQL spec.
So there is no "SQL spec" per se, there's an ANSI specification with decades of convention and provider-specific customizations piled on top. This support for trailing commas is the best you're going to get.
Not just SQL, trailing commas are stupidly useful and convenient, so as far as I'm concerned every language should have them. To be fair, a decent amount of them have implemented them (I was pleasantly surprised by GCC C), but there are still notable holdouts (JSON!).
> Unique and primary key indexes are rebuilt upon startup, while user-defined indexes are discarded.
The second part with just discarding previously defined indexes is super surprising.
https://duckdb.org/docs/sql/indexes
This was an instant showstopper for me or I assume most people whose databases grow to a bigger size at which point an OLAP DB becomes interesting in the first place.
Also the numerous issues in Github regarding crashes make me hesitant.
But I really like the core idea of DuckDB being a very simple codebase with no dependencies and still providing very good performance. I guess I just would like to see more SQLite-esque stability/robustness in the future and I'll surely revisit it at some point.
This took a little while because we use a fairly modern index structure with no literature definition on how to buffer manage it.
BTW Do you have some kind of code/docs one can take a look at regarding the index structure? I'm a part-time data structure nerd :)
Typesense DB (Typesense.org) under the hood also uses ART as the base datastructure for fast full text search queries.
So I assume it would be straightforward to build an text search engine on top of duck DB and utilize ART indices to do even more lovely things.
(disclaimer - on the DuckDB team)
I found https://duckdb.org/why_duckdb but I'm sure someone here can share some real world lessons learned?
First things first - we really like quite a lot about the SQLite approach. DuckDB is similarly easy to install and is built without dependencies, just like SQLite. It also runs in the same process as your application just like SQLite does. SQLite is excellent as a transactional database - lots of very specific inserts, updates, and deletes (called OLTP workloads). DuckDB can also read directly out of SQLite files as well, so you can mix and match them! (https://github.com/duckdblabs/sqlitescanner)
DuckDB is much faster than SQLite when doing analytical queries (OLAP) like when calculating summaries or trends over time, or joining large tables together. It can use all of your CPU cores for sometimes ~100x speedup over SQLite.
DuckDB also has some enhancements with respect to data transfer in and out of it. It can natively read Pandas, R, and Julia dataframes, and can read parquet files directly also (meaning without inserting first!).
Does that help? Happy to add more details!
How does DuckDB compare in that aspect? Does it have the same kind of guarantees of robustness, incorruptibility and performance (especially reading/writing binary blobs) that SQLite does?
In any case: DuckDB looks great, nice work! Good to have more players in this space!
Basically I'm wondering: if I go all in on DuckDB instead of SQLite would I notice? Do I have to keep anything in mind?
I know, probably difficult to answer without a concrete example of data, schema, queries and so on.
The SQL query features in the article seem really neat. Kudos @ shipping.
the other big thing is better native data types, especially dates. With SQLite if you want to work with timeseries you need to do your own date/time casting.
I need to do the benchmarks to substantiate this but this is my intuition.
1. SQLite has a great GUI and is really really widely supported.
2. DuckDB is properly statically typed with a much wider range of types than SQLite, which is dynamically typed and only just added support for any kind of type checking at all.
For programmers, it’s a tossup.
For most people working in data (databases, data engineering, ML etc) the column vs row thing makes a difference for datasets as small as a few hundred k records.
If you want to do something a bit more complex, you will have a bad time. Hello! With recursive.
Extremely useful, is there a reason why this is something not implemented in SQL in the first place? I often find myself writing very long queries just to select basically all columns except for two or three of them.
Even if that turns out to be a constant source of trouble worth not having, then why SQL can’t provide columnsets at least, so that queries could include, group or join on these predefined sets of columns instead of repeating tens of columns and/or expressions and/or aggregations many times across a single query. You had employees.bio_set=(name, dob), now you add `edu` to it and it just works everywhere, because you think in sets rather than in specific columns. Even group by bio_set works. Heck, I bet most of ORMs partially exist only to generate SQL, because it’s sometimes unbearable as is.
It supports:
- functions,
- using an alias in same `select` that defined it,
- trailing commas,
- date literals, f-strings and other small improvements we found unpleasant with SQL.
https://lang.prql.builders/introduction.html
The best part: it compiles into SQL. It's under development, though we will soon be releasing version 0.2 which would be "you can check it out"-version.
My fear with PRQL is that i'd more easily ask too much of the DB, given how easy it looks to write larger and more complex SQL. Thoughts?
But not always. Some analytical queries may actually need such complexity. Also, during development, you would sometimes pick only first 50 rows before joining and grouping, with intention of not overloading the db. To do this you need a CTE (or nested select), but in PRQL you just add a `take 50` transform to the top.
That would be really neat to have an official one. This articles makes me want to try it even more
Edit: I came across it via the podcast: https://www.dataengineeringpodcast.com/duckdb-in-process-ola...
Latest release notes: https://github.com/duckdb/duckdb/releases/tag/v0.3.3
[1] Error message: Not implemented Error: DataType TIMESTAMPZ not supported yet...
I fully agree that error message needs to improve, however. I will have a look at that.
DuckDB has all but replaced Pandas for my use cases. It’s much faster than Pandas even when working with Pandas data frames. I “import duckdb as db” more than I “import pandas as pd” these days.
The only thing I need now is a parallelized APPLY syntax in DuckDB.
https://fugue-tutorials.readthedocs.io/tutorials/integration...
SELECT * EXCLUDE (jar_jar_binks, midichlorians) FROM star_wars
Error: columns not foundOn further investigation, It seems that someone had maliciously injected lots of bogus data into the production database. We tried to clean up by truncating tables and dropping columns, but in the end it was easier to just restore from backup prior to 1999.
There still seems to be some residual corruption, most predominantly around mos_eisley and jabbas_palace data, and we had to truncate the end of Return of the Jedi, but not much was lost there.
Obviously there will be a desire to monetize this project, if not for the very simple reason of subsidizing the cost of its development and maintenance. I love everything I hear and see about this project, but it makes me nervous to recommend this internally due to it not only being in such an early stage, but also bc of any unforeseen costs and liabilities that it might introduce in the future.
First off - DuckDB is MIT licensed, so you are welcome to use and enhance it essentially however you please!
DuckDB Labs is a commercial entity that offers commercial support and custom integrations. (https://duckdblabs.com/). If the MIT DuckDB works for what you need, then you are all set no matter what!
However, much of the IP for DuckDB is owned by a foundation, so it is independent of that commercial entity. (https://duckdb.org/foundation/)
Does that help? Happy to answer any other questions!
If you have a specific question, definitely post it here and I will clarify!
Keep the puns, I just think with a bit of adjustment the examples would be easier to understand.
PS those examples were so good! really good writing :)
(Info from memory, may be wrong or a bit mangled)
EDIT: Typo
SELECT city, COUNT(*)
FROM customers
In SQLite is transformed into: SELECT FIRST(city), COUNT(*)
FROM customers
In our experience this is not a good default since it is almost never what you want, and hence we did not copy this behavior and instead throw an error in this situation. However, if we were to add an implicit `GROUP BY ALL` our transformed queries would now diverge, i.e. we would transform the above query to: SELECT city, COUNT(*)
FROM customers
GROUP BY city
Having diverging query results from SQLite on quite basic queries would confuse a lot of newcomers in DuckDB, and potentially cause silent problems when query semantics change when switching databases.We could definitely add a flag to enable this behavior, however.
A questions I have to author, or anyone using: Is there a easy way to transfer whole Postgres DB into DuckDB so I can do some tests with actual client data? I could export each table by hand and reimport it, but that is kind of painful.
DuckDB also has the ability to read Postgres data directly, and there is a Postgres FDW that can read from DuckDB!
It was basically syntactic sugar for a persistence API.
Instead of "select bar from foo" it used a "from foo select bar" type of syntax.
This was rather nice from a code completion perspective.
Normal insert, hard to read:
INSERT INTO table1 ( field1, field2, field3, ... ) VALUES ('value1', 'value2', 'value3', ... );
vs
Easier to read:
INSERT INTO table1 SET field1='value1', field2='value2', field3='value3', ...
SELECT age, sum(civility) as total_civility
FROM star_wars_universe
ORDER BY ALL
-- ORDER BY age, total_civility
there's no GROUP BY?edit: (removed edit, I blew it, sorry)
This is very cool though. There are lot of features that would make my life easier. Group By All is noice.