And that's a scenario which a cynic would observe was almost certainly anything but undesired by the top players on either side. It's much easier to require what would be considered a supercomputer not long ago, to run a word processor, than it is to create scenarios where such power is meaningfully applied in a mass market product.
> The Postgres Scanner uses the standard libpq library, which it statically links in. Ironically, this makes the Postgres Scanner easier to install than the other Postgres clients. However, Postgres’ normal client-server protocol is quite slow, so we spent quite some time optimizing this. As a note, DuckDB’s SQLite Scanner does not face this issue, as SQLite is also an in-process database.
> We actually implemented a prototype direct reader for Postgres’ database files, but while performance was great, there is the issue that committed but not yet checkpointed data would not be stored in the heap files yet. In addition, if a checkpoint was currently running, our reader would frequently overtake the checkpointer, causing additional inconsistencies. We abandoned that approach since we want to be able to query an actively used Postgres database and believe that consistency is important. Another architectural option would have been to implement a DuckDB Foreign Data Wrapper (FDW) for Postgres similar to duckdb_fdw but while this could improve the protocol situation, deployment of a postgres extension is quite risky on production servers so we expect few people will be able to do so.
> Instead, we use the rarely-used binary transfer mode of the Postgres client-server protocol. This format is quite similar to the on-disk representation of Postgres data files and avoids some of the otherwise expensive to-string and from-string conversions. For example, to read a normal int32 from the protocol message, all we need to do is to swap byte order (ntohl).
SELECT ... FROM postgresql(...) FORMAT Parquet
And you can run this query without installing ClickHouse, using the clickhouse-local command-line tool.
It can be downloaded simply as:
curl https://clickhouse.com/ | sh
If you don't use arrays and composites, Spark should be able to do it, right?
Does that help or do you have any other questions?
The datasets I work on are a bit too big for pandas, but spark is way overkill for them. DuckDB lets me efficiently work on them using only a single computer.
It really shows the difference in how a column store is so much better for curtain queries.
D CALL postgres_attach('host=blabla.us-east-2.rds.amazonaws.com port=5432 dbname=my_db connect_timeout=10 password=mypass user=my_user');
Error: Invalid Error: IO Error: Unsupported Postgres type intervalIf you want a shit-ton of built-in features DuckDB is pretty damn fast even with non-column store optimized queries. Surprisingly so! I have done side-by-side tests and you really have to have a lot of rows for it to start to make a difference. And their built-in functions are awesome. And they (company/group) are awesome and very responsive!
Having said that DuckDB is still young enough to make me nervous putting in critical application stacks and I feel about Sqlite the same way I feel about PostgreSQL ... it is solid as hell and will just work.
I will say that I have not had any stability issues in any of my personal projects that I use DuckDB in, it is just that I know for a fact that Sqlite is solid.
Also, I didn't quite understand if DuckDB in order to achieve this must:
1. Read the Postgres row formatted data
2. Transform the row formatted data into its internal columnar representation
3. Keep the representation in memory
No data is persisted in DuckDB unless you do an insert statement with the result of the Postgres scan. DuckDB does process that data in a columnar fashion once it has been pulled into DuckDB memory though!
Does that help?
What happens if the dataset size you want to post-process is let's say 1TB of size, or for that matter any size that is larger than the physical amount of memory available to DuckDB?
edit: to query pg from clickhouse not clickhouse from duckdb
https://clickhouse.com/docs/en/integrations/postgresql/postg...
You can connect to Postgres using a table function:
SELECT ... FROM postgresql(...)
You can create a table with ENGINE = PostgreSQL and use it like a normal table.
You can create a database with ENGINE = PostgreSQL and it will represent all the tables from the PostgreSQL database.
And finally, you can replicate data from PostgreSQL in realtime with CDC into ClickHouse tables.