Q – Execute SQL on text. Supports joins across files, RPM available (opens in new tab)

(github.com)

126 pointsharelba12y ago43 comments

43 comments

mjn12y ago

If you need only a subset of full SQL, e.g. just joins, counting/aggregation, and date manipulation (as in several of these examples), I've found it fairly easy to work with a mixture of the standard Unix join(1) [found on nearly all systems], and some of the additions from Google's crush-tools (https://code.google.com/p/crush-tools/), mainly 'aggregate', 'grepfield', 'funiq', and 'convdate'. I find chaining them together a bit easier than writing monolithic SQL statements, but there's probably some crossover point at which that wouldn't be true.

It'd be interesting to compare runtimes as well. I would guess that there's some overhead in loading into the DB up front, but that you might gain some speedup by converting longer chains of Unix pipes into one SQL query. On the other hand you might lose some parallelism. Would take some testing on different kinds of queries and data sets to get an idea of the pros/cons I think.

ozgune12y ago

Another interesting alternative could be using or writing a PostgreSQL foreign data wrapper.

There is one that supports tabular data (file_fdw), and another one for JSON files (json_fdw). If you have files in other formats, you can also write your fdw for it. This way, you get complete SQL coverage.

Also, if you don't want to pay the overhead of parsing the file every time, you can use the new materialized feature for caching: http://www.postgresql.org/docs/9.3/static/rules-materialized...

(Disclaimer: Enthused Postgres user.)

jdp12y ago

Matt Might wrote a blog post about using standard CLI tools for this purpose: http://matt.might.net/articles/sql-in-the-shell/

Rather than making a custom tool to issue SQL, the idea is that regular CLI tools map well to the traditional relational algebra operations. sed is like selection, cat is like union, etc.

unwind12y ago

It's written in Python, and seems to use SQLite under the hood.

I guess it just applies tokenization and throws the text into a temporary database.

Quite similar to the Go project https://github.com/dinedal/textql, at least superficially.

j_s12y ago

A lot of similar tools were discussed when that Go project came up 3 weeks ago:

https://news.ycombinator.com/item?id=7175830

  > MS ADO / ODBC
  > Perl DBI
  > npm j (with jqa)
  > Ruby (csv2sqlite) 
  > Python (csvkit)
  > Go (textql, comp)
  > Java (optiq, openrefine, H2 SQL)
  > R (sqldf)
  > Haskell (txt-sushi)
  > XML (xmlstarlet, xmllint, xmlstar)
  > HTML (HtmlAgilityPack, Chrome $x())
  > Postgres file_fdw
  > Oracle external tables
  > SQL Server OPENDATASOURCE and OPENQUERY
  > Log file viewers (MS LogParser, Apache asql, lnav)

harelbaOP12y ago

Hi, I'm q's creator, Harel.

There are obviously lots of other software which can provide a similar capability, and while I haven't checked all of them out, I'm really believe that most of them do a great job. However, my rationale for creating this tool was to provide a seamless addition to the Linux command line toolset - A tool as most Linux commands are, and not a capability. The distinction I'm doing here is that tools are reusable, composable and such, vs a capability which is usually less reusable in different contexts. I'm sure that some of the above are definitely tools. I just hope that the tool I have created provides value to people and helps them with their tasks.

As I posted here elsewhere, my complete rationale for creating the tool is available on the README of the github project. Comments and issues are most welcome.

Harel Ben-Attia

hipsters_unite12y ago

This seems like a cool project, but Q is already a well-used JS promise library so there's a bit of a name conflict there even if the libs are aimed at very different tasks...

mjn12y ago

And the JS promise library in turn collided with the name of Q, the array-processing language, which itself collided with the name of another programming language named Q (http://q-lang.sourceforge.net). If you're naming a tech-related thing after about 1980, the single-letter names are all taken...

hipsters_unite12y ago

Haha, well fair enough - even more reason to have checked I suppose. Maybe textQ would be more appropriate given this project's use case?

1 more reply

vog12y ago

Any library with a really short name (<= 2 letters) should be prepared for name collisions. If they wanted to avoid name collisions, they should have chosen a slightly longer, more descriptive name in the first place. I think in this context (i.e. really short name) it doesn't really matter who was first.

mbreese12y ago

It's also a single letter in the alphabet. Anyone who names their project a single letter (or two) is expecting some amount of collision. The real question is whether or not the collision is worth it.

In this case, it might be... they are trying to make a command-line tool. So in theory, you'll be typing the command often, meaning that a short name is preferable.

But honestly, it would have probably been a better idea to use a more descriptive name.

harelbaOP12y ago

Hi, i'm q's creator (HN made the name q uppercase, but it's actually a lowercase q). The reasoning was that it's used as a command line tool, and used often. So "q" and not "Q" :)I'm currently preparing the debian package, and one-letter names are not allowed, so it's going to be named "qsql" there.

chalst12y ago

Given that the command argument is in full SQL notation, which must be passed quoted, saving three or four characters on the executable name doesn't really seem worth it. The shortest example they give, a SELECT/GROUP query, is over 50 characters.

borntyping12y ago

It's also the name of a well known Python debugging tool[1]. Name conflicts will happen when you pick a name for shortness, not clarity.

[1] https://pypi.python.org/pypi/q

1ris12y ago

Or a APL family Programming language, marketed for writing queries. This could lead to some trouble.

http://en.wikipedia.org/wiki/Q_%28programming_language_from_...

ajtulloch12y ago

Dataset (https://dataset.readthedocs.org/) is a similar project that provides ad-hoc SQL querying of structured data - example usage:

  import dataset

  db = dataset.connect('sqlite:///:memory:')

  table = db['sometable']
  table.insert(dict(name='John Doe', age=37))
  table.insert(dict(name='Jane Doe', age=34, gender='female'))

  john = table.find_one(name='John Doe')

mseebach12y ago

Dataset provides ad-hoc SQL querying of ... SQL databases.

It looks like a neat API, but it's not really similar to this project.

bduerst12y ago

That's not ad hoc querying of a tabular text file.

sz4kerto12y ago

Eh, q as a name is going to be a serious problem as q is the language for programming kdb, a column-oriented database. As both are related to databases, it's difficult to defend this name.

BMarkmann12y ago

That was my initial impression, also... I though kx had added text-processing to kdb.

zeckalpha12y ago

Don't join (https://en.wikipedia.org/wiki/Join_%28Unix%29), sed, and grep get you most of the way there?

harelbaOP12y ago

Hi, I'm q's creator, Harel Ben-Attia.

The Linux toolset is really great, and I use it extensively. The whole idea of the tool is not to replace any of the existing tools, but to extend the toolset to concepts which treat text as data. In a way, it's a metatool which provides an easy and familiar way to add more data processing concepts to the linux toolset. There are many cases where I use 'wc -l' in order to count rows in a file, but if i need to count the rows of only the ones which have a specific column which is larger than the value X, or get the sum of some column per group, then q is a simple and readable way to do it properly, without any need for "tricks".

My rationale for creating it is also explained in the README of the github project.

Any more comments are most welcome.

Harel

ww212y ago

Name collision with Q/kdb+.

jmnicolas12y ago

What I really wished for when I read the title was for something that could make me write regexes but verbosely.

draegtun12y ago

You can write more verbose regex using Named Capture Buffers. Here is an example I posted on HN not long ago: https://news.ycombinator.com/item?id=6895126

NB. Follow link to original post to compare against standard regex version.

There are also some nice grammar parsers available in some languages which make this even easier. For examples of this see Perl6 Rules/Grammar, Perl5 Regexp::Grammars or (for something which doesn't used regex at all is) Rebol Parse.

For eg. Here is my Rebol version of the HN post above: http://www.reddit.com/r/programming/comments/1smpa1/why_rebo...

And here is a presentation which shows a great example using Perl6 grammars: http://jnthn.net/papers/2014-fosdem-perl6-today.pdf

Refs:

- http://en.wikibooks.org/wiki/Perl_6_Programming/Grammars

- http://en.wikipedia.org/wiki/Perl_6_rules

- https://metacpan.org/pod/Regexp::Grammars

- http://www.rebol.com/docs/core23/rebolcore-15.html

- http://blog.hostilefork.com/why-rebol-red-parse-cool/

PS. Alternatively f you looking for something interactive then checkout tools like these: http://rebol.informe.com/blog/2013/07/01/parse-aid/ | https://metacpan.org/pod/Regexp::Debugger

AUmrysh12y ago

In what way would you write regexes verbosely? I'm actually quite interested in the idea because regexes can be confusing to write at times, and it's difficult to remember which form to use where, if you use them in many languages/interfaces.

There are tools like Regexper[1] that let you visualize the regex as an automata graph, and there are tools like text2re[2] which will allow you to put in text and visually generate a regex to match it.

I feel like better regex tools should exist on the command line, and it's potentially a great place for such tools to be rapidly developed and adopted. There are GUI tools for this like poirot[3], but the command line still exists because of its accessibility, uniformity, and extensibility.

links:

[1] http://www.regexper.com/

[2] http://txt2re.com/index.php3?s=24%3AFeb%3A2014+%22This+is+an...

[3] http://www.espgraphics.com/poirot/

peterwwillis12y ago

I'm probably heavily biased, but to me Perl is the best command-line regex tool. Perl was invented to gather data and report on it, and its regex engine is incredibly fast and powerful. As an added bonus it supports some Python and PCRE-specific extensions. But this Q app is useful for people who either don't know Perl or can get what they need done faster with SQL than with scripting.

In terms of 'verbosity' you can embed comments inside a regular expression, or build a regular expression over multiple lines, or make a set of regex objects and interpolate them into larger regex's. Perl has copious amounts of documentation to help you understand the many ways to use regexs in Perl.

http://perldoc.perl.org/perlrequick.html http://perldoc.perl.org/perlretut.html http://perldoc.perl.org/perlfaq6.html#How-can-I-hope-to-use-...

1 more reply

jmnicolas12y ago

> In what way would you write regexes verbosely?

Something like SQL would be fine.

It's not a really thought out theory, but I think I'd like to manipulate text via a programming language like VI gods manipulate text with shortcuts.

Thanks for the link I will have a look at them.

1 more reply

bigd12y ago

A good compromise is to use something like Grok. https://github.com/jordansissel/grok

spacemanmatt12y ago

I'm still not seeing anything easier than ETLing into a regular old database, at which point you have the power and flexibility of a regular old database. Which is pretty spiffy, actually.

Are people really so bad at databases that they'll gladly suffer hacks like this to avoid using one?

spacemanmatt12y ago

I have ended up with MySQL and PostgreSQL on nearly every linux I've installed. It is baffling to me that someone would decline these tools in favor of text file drivers or SQLite, with the exception of using SQLite as an embedded config database or something.

super_mario12y ago

Yeah I also do the same, import the csv (it doesn't have to be comma separated) file into PostgreSQL and analyze it using SQL. Alternatively, I will load it into IPython/pandas and work with it like that.

This is however useful for one off, throwaway query that offers familiar SQL syntax, if you don't want to use awk that is.

gtrubetskoy12y ago

The credit should go to SQLite, which is extremely versatile. A while back I wrote Thredis (thredis.org), which lets you use SQL commands inside Redis (also using SQLite, of course).

pointernil12y ago

Now imagine some kind of RX.net libs made applicable to the output of the tail -F command... Reactive log file processing?

aethertap12y ago

This is really clever, and I'm definitely going to give it a workout on my system. Thanks!

chris_wot12y ago

Cool idea!

npongracic12y ago

Cool idea but i've read about something similar on Scott Hanselman's blog, it's called LogParser: http://www.hanselman.com/blog/AnalyzeYourWebServerDataAndBeE...

j / k navigate · click thread line to collapse

43 comments

mjn12y ago

ozgune12y ago

Another interesting alternative could be using or writing a PostgreSQL foreign data wrapper.

Also, if you don't want to pay the overhead of parsing the file every time, you can use the new materialized feature for caching: http://www.postgresql.org/docs/9.3/static/rules-materialized...

(Disclaimer: Enthused Postgres user.)

jdp12y ago

Matt Might wrote a blog post about using standard CLI tools for this purpose: http://matt.might.net/articles/sql-in-the-shell/

Rather than making a custom tool to issue SQL, the idea is that regular CLI tools map well to the traditional relational algebra operations. sed is like selection, cat is like union, etc.

unwind12y ago

It's written in Python, and seems to use SQLite under the hood.

I guess it just applies tokenization and throws the text into a temporary database.

Quite similar to the Go project https://github.com/dinedal/textql, at least superficially.

j_s12y ago

A lot of similar tools were discussed when that Go project came up 3 weeks ago:

https://news.ycombinator.com/item?id=7175830

  > MS ADO / ODBC
  > Perl DBI
  > npm j (with jqa)
  > Ruby (csv2sqlite) 
  > Python (csvkit)
  > Go (textql, comp)
  > Java (optiq, openrefine, H2 SQL)
  > R (sqldf)
  > Haskell (txt-sushi)
  > XML (xmlstarlet, xmllint, xmlstar)
  > HTML (HtmlAgilityPack, Chrome $x())
  > Postgres file_fdw
  > Oracle external tables
  > SQL Server OPENDATASOURCE and OPENQUERY
  > Log file viewers (MS LogParser, Apache asql, lnav)

harelbaOP12y ago

Hi, I'm q's creator, Harel.

As I posted here elsewhere, my complete rationale for creating the tool is available on the README of the github project. Comments and issues are most welcome.

Harel Ben-Attia

hipsters_unite12y ago

This seems like a cool project, but Q is already a well-used JS promise library so there's a bit of a name conflict there even if the libs are aimed at very different tasks...

mjn12y ago

hipsters_unite12y ago

Haha, well fair enough - even more reason to have checked I suppose. Maybe textQ would be more appropriate given this project's use case?

1 more reply

vog12y ago

mbreese12y ago

In this case, it might be... they are trying to make a command-line tool. So in theory, you'll be typing the command often, meaning that a short name is preferable.

But honestly, it would have probably been a better idea to use a more descriptive name.

harelbaOP12y ago

chalst12y ago

borntyping12y ago

It's also the name of a well known Python debugging tool[1]. Name conflicts will happen when you pick a name for shortness, not clarity.

[1] https://pypi.python.org/pypi/q

1ris12y ago

Or a APL family Programming language, marketed for writing queries. This could lead to some trouble.

http://en.wikipedia.org/wiki/Q_%28programming_language_from_...

ajtulloch12y ago

Dataset (https://dataset.readthedocs.org/) is a similar project that provides ad-hoc SQL querying of structured data - example usage:

  import dataset

  db = dataset.connect('sqlite:///:memory:')

  table = db['sometable']
  table.insert(dict(name='John Doe', age=37))
  table.insert(dict(name='Jane Doe', age=34, gender='female'))

  john = table.find_one(name='John Doe')

mseebach12y ago

Dataset provides ad-hoc SQL querying of ... SQL databases.

It looks like a neat API, but it's not really similar to this project.

bduerst12y ago

That's not ad hoc querying of a tabular text file.

sz4kerto12y ago

Eh, q as a name is going to be a serious problem as q is the language for programming kdb, a column-oriented database. As both are related to databases, it's difficult to defend this name.

BMarkmann12y ago

That was my initial impression, also... I though kx had added text-processing to kdb.

zeckalpha12y ago

Don't join (https://en.wikipedia.org/wiki/Join_%28Unix%29), sed, and grep get you most of the way there?

harelbaOP12y ago

Hi, I'm q's creator, Harel Ben-Attia.

My rationale for creating it is also explained in the README of the github project.

Any more comments are most welcome.

Harel

ww212y ago

Name collision with Q/kdb+.

jmnicolas12y ago

What I really wished for when I read the title was for something that could make me write regexes but verbosely.

draegtun12y ago

You can write more verbose regex using Named Capture Buffers. Here is an example I posted on HN not long ago: https://news.ycombinator.com/item?id=6895126

NB. Follow link to original post to compare against standard regex version.

For eg. Here is my Rebol version of the HN post above: http://www.reddit.com/r/programming/comments/1smpa1/why_rebo...

And here is a presentation which shows a great example using Perl6 grammars: http://jnthn.net/papers/2014-fosdem-perl6-today.pdf

Refs:

- http://en.wikibooks.org/wiki/Perl_6_Programming/Grammars

- http://en.wikipedia.org/wiki/Perl_6_rules

- https://metacpan.org/pod/Regexp::Grammars

- http://www.rebol.com/docs/core23/rebolcore-15.html

- http://blog.hostilefork.com/why-rebol-red-parse-cool/

PS. Alternatively f you looking for something interactive then checkout tools like these: http://rebol.informe.com/blog/2013/07/01/parse-aid/ | https://metacpan.org/pod/Regexp::Debugger

AUmrysh12y ago

links:

[1] http://www.regexper.com/

[2] http://txt2re.com/index.php3?s=24%3AFeb%3A2014+%22This+is+an...

[3] http://www.espgraphics.com/poirot/

peterwwillis12y ago

http://perldoc.perl.org/perlrequick.html http://perldoc.perl.org/perlretut.html http://perldoc.perl.org/perlfaq6.html#How-can-I-hope-to-use-...

1 more reply

jmnicolas12y ago

> In what way would you write regexes verbosely?

Something like SQL would be fine.

It's not a really thought out theory, but I think I'd like to manipulate text via a programming language like VI gods manipulate text with shortcuts.

Thanks for the link I will have a look at them.

1 more reply

bigd12y ago

A good compromise is to use something like Grok. https://github.com/jordansissel/grok

spacemanmatt12y ago

I'm still not seeing anything easier than ETLing into a regular old database, at which point you have the power and flexibility of a regular old database. Which is pretty spiffy, actually.

Are people really so bad at databases that they'll gladly suffer hacks like this to avoid using one?

spacemanmatt12y ago

super_mario12y ago

This is however useful for one off, throwaway query that offers familiar SQL syntax, if you don't want to use awk that is.

gtrubetskoy12y ago

The credit should go to SQLite, which is extremely versatile. A while back I wrote Thredis (thredis.org), which lets you use SQL commands inside Redis (also using SQLite, of course).

pointernil12y ago

Now imagine some kind of RX.net libs made applicable to the output of the tail -F command... Reactive log file processing?

aethertap12y ago

This is really clever, and I'm definitely going to give it a workout on my system. Thanks!

chris_wot12y ago

Cool idea!

npongracic12y ago

Cool idea but i've read about something similar on Scott Hanselman's blog, it's called LogParser: http://www.hanselman.com/blog/AnalyzeYourWebServerDataAndBeE...

j / k navigate · click thread line to collapse