Run SQL on CSV, Parquet, JSON, Arrow, Unix Pipes and Google Sheet (opens in new tab)

(github.com)

294 pointshouqp3y ago62 comments

62 comments

This is cool...Totally reminded me about several tools pop up on HN every now and then in the past for similar task so i did a quick search:

clickhouse-local - https://news.ycombinator.com/item?id=22457767

q - https://news.ycombinator.com/item?id=27423276

textql - https://news.ycombinator.com/item?id=16781294

simpql- https://news.ycombinator.com/item?id=25791207

We need a benchmark i think..;)

sjellis3y ago

I am currently evaluating dsq and its partner desktop app DataStation. AIUI, the developer of DataStation realised that it would be useful to extract the underlying pieces into a standalone CLI, so they both support the same range of sources.

dsq CLI - https://github.com/multiprocessio/dsq DataStation desktop app - https://datastation.multiprocess.io/

Two alternative CLI tools that I looked at:

sq - https://sq.io/ octosql - https://github.com/cube2222/octosql

Honourable mentions to:

Miller - https://miller.readthedocs.io/en/latest/index.html Dasel - https://daseldocs.tomwright.me/

These don't use SQL.

VisiData is also amazing for data navigation, although it requires some effort to get the model:

https://www.visidata.org/

alserio3y ago

xsv is invaluable for processing big csv files: https://github.com/BurntSushi/xsv

thejosh3y ago

qsv is a fork of this, as qsv is pretty much unmaintained now (I don't mean to sound negative, BurntSushi did an AMAZING job and I love the work they did).

1 more reply

tanin3y ago

Shameless plug. A desktop app: https://superintendent.app

wodenokoto3y ago

It looks really nice.

I’ve spend what feels like too much time looking for a windows app that I can use to occasionally preview CSV files before moving them into WSL.

I just think $50/year is too much for occasional use.

Since this posts has initiated several other shameless plugs, maybe I can swoop in and ask for shameless recommendations for something cheap in this space.

tanin3y ago

Thank you for your kind words. Could you elaborate your use case?

Drop me an email at tanin@ superintendent.app, we should be able to work something out regarding pricing.

ebfe13y ago

we need more shameless plug on this thread ... the more benchmark, the better! I think this would be a fun weekend afternoon :)

eirikbakke3y ago

Another shameless plug, then: https://www.ultorg.com

This one is another desktop app, with a slightly different approach: in the CSV file case, the data is parsed and dumped into a bundled PostgreSQL database, which you can then query with a visual query interface.

tstack3y ago

Oh, okay... https://lnav.org is a log file viewer for the terminal that integrates with SQLite so you can use SQL to query your log files.

No benchmarks, though.

marklit3y ago

I'll save you the trouble, ClickHouse will come up on top by a mile and is the only one listed that supports clustering and spill to disk.

antman3y ago

Groupby and joins are a good test. You could also give Duckdb cli api a try (it is column based)

mattewong3y ago

another shameless plug: https://github.com/liquidaty/zsv

mmastrac3y ago

The one thing everyone here is missing so far is that it's a Rust binary, distributed on PyPi. That's brilliant.

simonw3y ago

I wrote a bit about this pattern here: https://simonwillison.net/2022/May/23/bundling-binary-tools-...

samwillis3y ago

I’m all in on using PyPI for binary distribution. Couple that with Python Venv and you have a brilliant system for per project dependancies.

I created this project for distributing Node via PyPI: https://pypi.org/project/nodejs-bin/

WHATDOESIT3y ago

Why not rather distribute Python as NPM dependency?

einpoklum3y ago

You can get a statically-linked binary release from GitHub which depends on nothing (I think).

jonahx3y ago

Can you explain the advantages of this vs cargo?

houqpOP3y ago

Most users already have pip installed, so they won't need to install a rust toolchain.

proto_lambda3y ago

cargo is not a binary distribution.

gavinray3y ago

1) roapi is built with some wicked cool tech

2) the author once answered some questions I posted on Datafusion, so they're cool in my book

Here are my anecdotes.

playingalong3y ago

Bye bye jq and your awful query syntax.

franga20003y ago

I agree jq's syntax it doesn't make much sense for tables where the primary operations are filter and merge, but for deep tree-like datasets, which is what JSON is supposed to be used for, traversal and iteration are more important and the syntax makes perfect sense there.

I'd be willing to bet most programmers would instantly understand something like `.users[] | {email: .email, lastLogin: .logins[-1].date}`, even if they've never seen jq.

Now that I'm thinking about it, the kind of structures we often use JSON for are, in a way, a subset of what can be done with tables and pointers (foreign keys), so would it be possible to create a kind of jq to SQL compiler? Has anyone tried that?

henrydark3y ago

It is pretty cool. py-spy has also been doing this for a few years

https://github.com/benfred/py-spy

tootie3y ago

AWS Athena offers something similar. You can build tables off of structured text files (like log files) in S3 and run SQL queries.

ramraj073y ago

What’s the performance like though?

avereveard3y ago

Response in seconds if you hit partitions (a version of index that only support equality and its implemented in folders in this case) and your storage system contains content metadata like parquet or orc.

Is not a BI tool tho, if queries have high variability on the where clause and you can't leverage indexes then you're looking at minutes as response time.

If the data is not in structures but plain csv/json, all bets are off.

I've not yet tested it at terabyte scale albeit it should happily scale up there.

tootie3y ago

I have not put it through any stress tests. I look at this kind of tool as a nice convenience. If I needed something high throughput I'd probably want a full baked data warehouse pipeline.

atwebb3y ago

It really, really, realllllyyyy, depends on how you setup your prefix/"folder" structure and the underlying file format. Though that's almost certainly true here.

johnnunn3y ago

I have a use case, where my company's application logs will be shipped to S3 in a directory structure such as application/timestamp(one_hour)_logs.parquet. We want to build a simple developer focussed UI, where we can query for a given application for a time range and retrieve a bunch of s3 blobs in that time range and brute force search for the desired string. I see that roapi offers a REST interface for a fixed set of files but I would like to dynamically glob newer files. Are there are alternatives that can be used too ? Thanks

mkane3y ago

If you're already using parquet, it might be worth looking at the concept of datasets e.g. https://arrow.apache.org/docs/python/generated/pyarrow.parqu...

cube22223y ago

Amazon Athena + AWS Glue for schema discovery can do this.

marek_leisk23y ago

Trino can do this.

cube22223y ago

This looks really cool! Especially using datafusion underneath means that it probably is blazingly fast.

If you like this, I recommend taking a look at OctoSQL[0], which I'm the author of.

It's plenty fast and easier to add new data sources for as external plugins.

It can also handle endless streams of data natively, so you can do running groupings on i.e. tailed JSON logs.

Additionally, it's able to push down predicates to the database below, so if you're selecting 10 rows from a 1 billion row table, it'll just get those 10 rows instead of getting them all and filtering in memory.

[0]: https://github.com/cube2222/octosql

sakras3y ago

> datafusion

> blazingly fast

I’m going to need to see a citation for that. Last I checked, it was being beaten by Apache Spark in non-memory constrained scenarios [0]. This may be “blazingly fast” compared to Pandas or something, but it’s still leaving a TON of room on the table performance-wise. There’s a reason why Databricks found it necessary to redirect their Spark backend to a custom native query engine [1].

[0] https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/

[1] https://cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf

houqpOP3y ago

Datafusion out performs spark by a large margin. It is on par with photon based on my experiences, see benchmarks at https://github.com/blaze-init/blaze.

sakras3y ago

Ah nice, thank you for sharing that. I hadn’t seen it before, and congrats on beating out Spark that hard, I hope it continues to improve!

As an aside, maybe it would make sense to publish a new blog post somewhere so that the top hit on Google for “DataFusion benchmark” isn’t that post I linked.

1 more reply

cube22223y ago

Ok, I have now actually benchmarked this roapi CLI on the Amazon Review Dataset and it's over 20x slower than OctoSQL.

A simple group by

  time columnq sql --table books_10m.ndjson "SELECT AVG(overall) FROM books_10m"

takes 66 seconds.

The equivalent in OctoSQL takes less than 3 seconds.

I retract my statement about this project being blazingly fast, though I imagine it's just the JSON parser that requires optimization.

smugma3y ago

SQL on CSV (using preinstalled Mac tools) previously linked on HN: https://til.simonwillison.net/sqlite/one-line-csv-operations

e.g.

sqlite3 :memory: -cmd '.mode csv' -cmd '.import royalties.csv Royalty' -cmd '.mode column' \

    'SELECT SUM(Royalty),Currency FROM Royalty GROUP BY Currency'

bachmeier3y ago

As I commented on a recent similar discussion, these tools can't be used for update or insert. As useful as querying might be, it's terribly misleading to claim to "run SQL" if you can't change the data, since that's such a critical part of an SQL database.

mgradowski3y ago

What you're really saying is that the database presented in OP is not useful because it only handles DQL.

1. SQL can be thought of as being composed of several smaller lanuages: DDL, DQL, DML, DCL.

2. columnq-cli is only a CLI to a query engine, not a database. As such, it only supports DQL by design.

3. I have the impression that outside of data engineering/DBA, people are rarely taught the distinction between OLTP and OLAP workloads [1]. The latter often utilizes immutable data structures (e.g. columnar storage with column compression), or provides limited DML support, see e.g. the limitations of the DELETE statement in ClickHouse [2], or the list of supported DML statements in Amazon Athena [3]. My point -- as much as this tool is useless for transactional workloads, it is perfectly capable of some analytical workloads.

[1] Opinion, not a fact.

[2] https://clickhouse.com/docs/en/sql-reference/statements/dele...

[3] https://docs.aws.amazon.com/athena/latest/ug/functions-opera...

TAForObvReasons3y ago

The title is an editorialization. The project is very careful to emphasize that it is for reading data:

> Create full-fledged APIs for slowly moving datasets without writing a single line of code.

Even the name of the project "ROAPI" has "read only" in the name.

gavinray3y ago

Question: I've built something that supports full CRUD, and queries that span multiple data sources with optimization and pushdown

What kind of headline would make you want to read/try such a thing?

(I'm planning on announcing it + releasing code on HN but have never done so before)

fshr3y ago

Hi Gavin; that sounds interesting! I saw @eirikbakke make a comment about https://www.ultorg.com earlier. It appears to also support editing the underlying data. I'm curious to see how you've each tackled these tricky topics.

porker3y ago

Show HN: Read and update Arrow, Parquet and xxxx files using SQL

gavinray3y ago

It works on databases and arbitrary data sources too though

andygrove3y ago

I think it is worth pointing out that this tool does support querying Delta Lake (the author of ROAPI is also a major contributor the native Rust implementation of Delta Lake). Delta Lake certainly supports transactions, so ROAPI can query transactional data, although the writes would not go through ROAPI.

tomrod3y ago

90% of SQL usage, or more, is select in slowly changing data contexts.

jaxn3y ago

Maybe in your database. Do you have any validation of that claim in a larger context?

tomrod3y ago

Purely the power law. That would be an interesting thing to figure out though. Maybe a github crawl.

EDIT: I stand corrected based on github code files (which might better represent application CRUD queries versus use by analysts, more thought required!)

SELECT: 7.3M code results [0]

INSERT: 8.9M code results [1]

UPDATE: 5.5M code results [2]

DELETE: 5.0M code results [3]

[0] https://github.com/search?q=select++extension%3Asql&type=Cod...

[1] https://github.com/search?q=insert++extension%3Asql&type=Cod...

[2] https://github.com/search?q=update++extension%3Asql&type=Cod...

[3] https://github.com/search?q=delete++extension%3Asql&type=Cod...

1 more reply

cerved3y ago

i disagree

skybrian3y ago

Looks like it also supports SQLite for input, but not for output. That might be a nice addition.

the_optimist3y ago

What’s the memory handling behavior here? Are CSVs read on query or at startup? What about Arrow? If read on startup, is there compression applied?

theGnuMe3y ago

This is really cool and redefines ETL pipelines.

whimsicalism3y ago

Trino can do this as well.

Kalanos3y ago

is there a pythonic api for scripting (not command line)? i was looking for a json query tool and couldn't find one.

houqpOP3y ago

Yes, I designed the code base so that the core of the IO and query logic are abstracted into a Rust library called columnq. My plan is to wrap it with pyo3 so the full API can be accessed as a Python package! If you are interested in helping with this, please feel free to submit a PR. The core library is located at https://github.com/roapi/roapi/tree/main/columnq

j / k navigate · click thread line to collapse

62 comments

ebfe13y ago

This is cool...Totally reminded me about several tools pop up on HN every now and then in the past for similar task so i did a quick search:

clickhouse-local - https://news.ycombinator.com/item?id=22457767

q - https://news.ycombinator.com/item?id=27423276

textql - https://news.ycombinator.com/item?id=16781294

simpql- https://news.ycombinator.com/item?id=25791207

We need a benchmark i think..;)

sjellis3y ago

dsq CLI - https://github.com/multiprocessio/dsq DataStation desktop app - https://datastation.multiprocess.io/

Two alternative CLI tools that I looked at:

sq - https://sq.io/ octosql - https://github.com/cube2222/octosql

Honourable mentions to:

Miller - https://miller.readthedocs.io/en/latest/index.html Dasel - https://daseldocs.tomwright.me/

These don't use SQL.

VisiData is also amazing for data navigation, although it requires some effort to get the model:

https://www.visidata.org/

alserio3y ago

xsv is invaluable for processing big csv files: https://github.com/BurntSushi/xsv

thejosh3y ago

qsv is a fork of this, as qsv is pretty much unmaintained now (I don't mean to sound negative, BurntSushi did an AMAZING job and I love the work they did).

1 more reply

tanin3y ago

Shameless plug. A desktop app: https://superintendent.app

wodenokoto3y ago

It looks really nice.

I’ve spend what feels like too much time looking for a windows app that I can use to occasionally preview CSV files before moving them into WSL.

I just think $50/year is too much for occasional use.

Since this posts has initiated several other shameless plugs, maybe I can swoop in and ask for shameless recommendations for something cheap in this space.

tanin3y ago

Thank you for your kind words. Could you elaborate your use case?

Drop me an email at tanin@ superintendent.app, we should be able to work something out regarding pricing.

ebfe13y ago

we need more shameless plug on this thread ... the more benchmark, the better! I think this would be a fun weekend afternoon :)

eirikbakke3y ago

Another shameless plug, then: https://www.ultorg.com

tstack3y ago

Oh, okay... https://lnav.org is a log file viewer for the terminal that integrates with SQLite so you can use SQL to query your log files.

No benchmarks, though.

marklit3y ago

I'll save you the trouble, ClickHouse will come up on top by a mile and is the only one listed that supports clustering and spill to disk.

antman3y ago

Groupby and joins are a good test. You could also give Duckdb cli api a try (it is column based)

mattewong3y ago

another shameless plug: https://github.com/liquidaty/zsv

mmastrac3y ago

The one thing everyone here is missing so far is that it's a Rust binary, distributed on PyPi. That's brilliant.

simonw3y ago

I wrote a bit about this pattern here: https://simonwillison.net/2022/May/23/bundling-binary-tools-...

samwillis3y ago

I’m all in on using PyPI for binary distribution. Couple that with Python Venv and you have a brilliant system for per project dependancies.

I created this project for distributing Node via PyPI: https://pypi.org/project/nodejs-bin/

WHATDOESIT3y ago

Why not rather distribute Python as NPM dependency?

einpoklum3y ago

You can get a statically-linked binary release from GitHub which depends on nothing (I think).

jonahx3y ago

Can you explain the advantages of this vs cargo?

houqpOP3y ago

Most users already have pip installed, so they won't need to install a rust toolchain.

proto_lambda3y ago

cargo is not a binary distribution.

gavinray3y ago

1) roapi is built with some wicked cool tech

2) the author once answered some questions I posted on Datafusion, so they're cool in my book

Here are my anecdotes.

playingalong3y ago

Bye bye jq and your awful query syntax.

franga20003y ago

I'd be willing to bet most programmers would instantly understand something like `.users[] | {email: .email, lastLogin: .logins[-1].date}`, even if they've never seen jq.

henrydark3y ago

It is pretty cool. py-spy has also been doing this for a few years

https://github.com/benfred/py-spy

tootie3y ago

AWS Athena offers something similar. You can build tables off of structured text files (like log files) in S3 and run SQL queries.

ramraj073y ago

What’s the performance like though?

avereveard3y ago

Is not a BI tool tho, if queries have high variability on the where clause and you can't leverage indexes then you're looking at minutes as response time.

If the data is not in structures but plain csv/json, all bets are off.

I've not yet tested it at terabyte scale albeit it should happily scale up there.

tootie3y ago

I have not put it through any stress tests. I look at this kind of tool as a nice convenience. If I needed something high throughput I'd probably want a full baked data warehouse pipeline.

atwebb3y ago

It really, really, realllllyyyy, depends on how you setup your prefix/"folder" structure and the underlying file format. Though that's almost certainly true here.

johnnunn3y ago

mkane3y ago

If you're already using parquet, it might be worth looking at the concept of datasets e.g. https://arrow.apache.org/docs/python/generated/pyarrow.parqu...

cube22223y ago

Amazon Athena + AWS Glue for schema discovery can do this.

marek_leisk23y ago

Trino can do this.

cube22223y ago

This looks really cool! Especially using datafusion underneath means that it probably is blazingly fast.

If you like this, I recommend taking a look at OctoSQL[0], which I'm the author of.

It's plenty fast and easier to add new data sources for as external plugins.

It can also handle endless streams of data natively, so you can do running groupings on i.e. tailed JSON logs.

[0]: https://github.com/cube2222/octosql

sakras3y ago

> datafusion

> blazingly fast

[0] https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/

[1] https://cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf

houqpOP3y ago

Datafusion out performs spark by a large margin. It is on par with photon based on my experiences, see benchmarks at https://github.com/blaze-init/blaze.

sakras3y ago

Ah nice, thank you for sharing that. I hadn’t seen it before, and congrats on beating out Spark that hard, I hope it continues to improve!

As an aside, maybe it would make sense to publish a new blog post somewhere so that the top hit on Google for “DataFusion benchmark” isn’t that post I linked.

1 more reply

cube22223y ago

Ok, I have now actually benchmarked this roapi CLI on the Amazon Review Dataset and it's over 20x slower than OctoSQL.

A simple group by

  time columnq sql --table books_10m.ndjson "SELECT AVG(overall) FROM books_10m"

takes 66 seconds.

The equivalent in OctoSQL takes less than 3 seconds.

I retract my statement about this project being blazingly fast, though I imagine it's just the JSON parser that requires optimization.

smugma3y ago

SQL on CSV (using preinstalled Mac tools) previously linked on HN: https://til.simonwillison.net/sqlite/one-line-csv-operations

e.g.

sqlite3 :memory: -cmd '.mode csv' -cmd '.import royalties.csv Royalty' -cmd '.mode column' \

    'SELECT SUM(Royalty),Currency FROM Royalty GROUP BY Currency'

bachmeier3y ago

mgradowski3y ago

What you're really saying is that the database presented in OP is not useful because it only handles DQL.

1. SQL can be thought of as being composed of several smaller lanuages: DDL, DQL, DML, DCL.

2. columnq-cli is only a CLI to a query engine, not a database. As such, it only supports DQL by design.

[1] Opinion, not a fact.

[2] https://clickhouse.com/docs/en/sql-reference/statements/dele...

[3] https://docs.aws.amazon.com/athena/latest/ug/functions-opera...

TAForObvReasons3y ago

The title is an editorialization. The project is very careful to emphasize that it is for reading data:

> Create full-fledged APIs for slowly moving datasets without writing a single line of code.

Even the name of the project "ROAPI" has "read only" in the name.

gavinray3y ago

Question: I've built something that supports full CRUD, and queries that span multiple data sources with optimization and pushdown

What kind of headline would make you want to read/try such a thing?

(I'm planning on announcing it + releasing code on HN but have never done so before)

fshr3y ago

porker3y ago

Show HN: Read and update Arrow, Parquet and xxxx files using SQL

gavinray3y ago

It works on databases and arbitrary data sources too though

andygrove3y ago

tomrod3y ago

90% of SQL usage, or more, is select in slowly changing data contexts.

jaxn3y ago

Maybe in your database. Do you have any validation of that claim in a larger context?

tomrod3y ago

Purely the power law. That would be an interesting thing to figure out though. Maybe a github crawl.

EDIT: I stand corrected based on github code files (which might better represent application CRUD queries versus use by analysts, more thought required!)

SELECT: 7.3M code results [0]

INSERT: 8.9M code results [1]

UPDATE: 5.5M code results [2]

DELETE: 5.0M code results [3]

[0] https://github.com/search?q=select++extension%3Asql&type=Cod...

[1] https://github.com/search?q=insert++extension%3Asql&type=Cod...

[2] https://github.com/search?q=update++extension%3Asql&type=Cod...

[3] https://github.com/search?q=delete++extension%3Asql&type=Cod...

1 more reply

cerved3y ago

i disagree

skybrian3y ago