Run SQL on JSON files without any data loads (opens in new tab)

(citusdata.com)

108 pointsmihar12y ago35 comments

35 comments

ot12y ago

A few years ago I was toying with JSON databases without loading, and came up with an idea to cache the parse tree of the documents after the first query, so that subsequent queries would run much faster. I called the technique semi-indexing and wrote a paper [1] on that; on my synthetic tests the speedups were significant (even 10x), but I never got the chance to test it on real workloads.

I wonder if would be useful to integrate the semi-index code [2] in this json_fdw; thanks for sharing the code, I'll try to see how feasible this is.

[1] http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf

[2] https://github.com/ot/semi_index

mmastrac12y ago

This paper is fascinating. I have an extremely fast JSON parser for Java that I've been working on, on and off [1], and I'd love to optionally augment it with semi-indexes for cases where end-users are using a subset of a file.

[1] https://github.com/mmastrac/nanojson

ot12y ago

Very interesting! Implementing a semi-index is very easy if you have implementations of Elias-Fano and balanced parentheses data structures.

For Java I recommend Sux4j [1] that has a very good implementation of Elias-Fano, but I think that balanced parentheses are very primitive. I was told by the author that a better data structure, based on Range-Min-Max trees, should be added soon.

[1] http://sux.dsi.unimi.it/

mtdewcmu12y ago

I've enjoyed playing with succinct data structures because they make interesting puzzles, and I've come across your papers. Succinct data structures seem to be a drain on performance, though, for ordinary kinds of datasets. It's not easy to find the kind of datasets that are so large that succinctness would allow them to fit in memory when they otherwise wouldn't (not working in computational biology, etc). Using regular data structures would probably increase performance more and meet most people's needs, and it would certainly reduce the code complexity.

mortov12y ago

I remember when it caught my eye as Postgres95 and I've always preferred it, even in the face of the huge surge for MySQL which left PostgreSQL looking like an also-ran for quite while.

The reality is it is one of the most amazing open source projects I know - enough to rival Linux itself for me.

These little snippets of PostgreSQL features and facilities which keep being posted give just a tiny hint of the truly enterprise grade features and facilities it has - I just wish I still had as much time to keep on top of it as I once did but I still keep it as my go-to database for project development. I've even dropped it in to replace full blown Oracle systems for serious cost savings.

The developers should be receiving far more kudos than they do.

d4nt12y ago

I don't make a habit of spamming discussion threads but... my product QueryTree (http://querytreeapp.com) will load JSON files even if they have lists at the root, you can use the Append tool to bring multiple files together, and you can then use Filter, Sort, Join and Group without ever having to define a schema.

Admittedly, it's aimed at novice users so if you're comfortable with SQL it may feel limiting.

MartinCron12y ago

It's not spamming if it's relevant.

knowtheory12y ago

I am mystified as to when this tool would be useful (can anybody else think of a practical use case?).

You can run SQL queries on individual JSON files... but you have to have PostgreSQL installed already.

And you can't run it against multiple JSON files, only one, with a single JSON object at the root.

ozgune12y ago

(Ozgun at Citus Data)

You're right that you need to have Postgres installed. For running SQL over multiple JSON files, we wanted to keep the blog post short and noted several different ways to go over multiple files in our GitHub Readme.

1. You can create a partitioned PostgreSQL table, and declare one child table per JSON file. You can also declare constraints on the child table to filter out irrelevant files to the query.

2. You can create a distributed foreign table on CitusDB, even if you'd like to run on one node. In this case, we'll collect statistics automatically for you, and filter out irrelevant files.

3. If neither of these fit, you could change the source code to scan over one directory instead of a file. We didn't go down this path to be compatible with file_fdw's semantics.

tracker112y ago

Just an idea... I had a script doing record dumps at a prior job... (there were about 30 joins for getting a single record (denormalized) into json), so having each one in a separate .json.gz file was a useful backup solution... it was about zero cost when doing mirroring to another system for query/display purposes.

In the file system, to prevent too many records in a single directory it was split up per 1000 records... base/00001000/(1000-1999).json.gz ... this was mainly for being able to navigate this structure via a gui. I would suggest if your system can't do "basepath//*.json.gz" that you consider it.

neilk12y ago

Okay, so if I understand correctly, the benefit here is that I can run some kinds of SQL queries on a large collection of heterogenous JSON documents. Is that right? You're still speaking Postgres-internals-geek, not I-want-to-get-something-done geek.

That is:

- you can take heterogenous JSON documents, and then make them appear to the Postgres world as separate tables. (Which they have to be, because they're not uniform)

- you can then take some kinds of SQL queries and distribute them among these tables, as if they were one big SQL table. If some pseudo-tables have missing or extra columns, no problem.

ozgune12y ago

Thanks for the comment. Yes, that's exactly the benefit.

You could in fact have heterogeneous JSON documents within one file, and json_fdw could still handle them. If a field you declared in the table schema doesn't appear in a JSON document, json_fdw would just consider that field to be Null.

The issue is, json_fdw is built such that one table can only associated with one JSON file, and users typically have many JSON files lying around. For example, you'd have one JSON file for each hour's worth of website error logs.

If that's the case, you could create a distributed table. That way, you could run the query one distributed table and have the query run on all your JSON files instead of just one.

That said, it's probably best to associate one type of JSON logs (say your website error logs) with one distributed table. That way, your queries go over your website error logs or your mobile application logs, but not both at the same time.

knowtheory12y ago

It should be said I think that it's super cool that you guys are putting Postgres to broader use. More stuff can and should be done with it, i'm just not sure what to do with this particular tool :)

chaz12y ago

Keep in mind that this blog post was written by a company builds a product that sits on top of PostgreSQL. So for the intended audience, it's completely relevant.

A good use case is exploring external APIs, many of which come back as JSON. This might be a faster awy to figure out what you're dealing with: how large, how many, unique values, mins, maxes, etc. It could be especially useful if you need to match it to existing keys or metadata that you already have in the same database. To solve this same problem before, I've written lots of one-off scripts that convert the JSON into a CSV so that I can view it in Excel. This may or may not help me there, but good to see another tool.

malkia12y ago

It could be. In game development for example, often a game asset is nothing more than a bunch of key/values, and they are slapped either into individual file per asset, or grouping several such asset into one file. The text format is not important (now), but I've seen "ini" style ones, or using lua, xml, or nowadays json.

It's important for these files to be kept this way, because they can be submitted to a SCM (perforce, svn, etc.), and this is the typical workflow. An SQL/noSQL db won't cut it here, since there is no revision number, or even if there is, it introduces a problem of syncing what's in the DB vs. what's in SCM.

So such tool would become interresting, if after syncing your latest data of the depot (repo), you run a tool that imports them into the DB to be used by other tools. And when you are done, export is done to the same files, and submitted.

Some automation could be done as well.

And having such tools would be beneficial, since you now don't have to write another tool to be ran manually, or being resident as service to get the latest file changes and update the DB, but could be make as an PostgreSQL worker that imports on the fly changes and updates the DB at the same time.

E.g. - if you use Perforce: - Someone syncs - The DB automatically picks up the changes, informed by the file system, and updates the DB.

The more of such things the better.

danielrhodes12y ago

It's a good way to do adhoc querying of JSON log files, which among other things, can be useful for debugging.

masklinn12y ago

Can also be useful to mix internal SQL data from JSON from an external source without having to manually sync/import it no?

seppo001012y ago

The only valid use case I can think of is for preloaded data for automatized tests.

knowtheory12y ago

Oh, that might be useful. Although, most test frameworks at this point have mechanisms for pretty easily load data, but it'd be interesting to figure out if any corners could be conveniently cut with this.

ajtaylor12y ago

With the upcoming release of Pg 9.3, I've been reading a lot more posts and articles about the FDW (foreign data wrapper) capabilities. I had not considered all the things FDW can do, but I'm getting more and more excited the further I get into it.

For instance, at $work we have a bunch of tables which are archives of raw click data. They are taking up quite a bit of space on disk, which we'd like to reclaim. Keeping the raw data around is good, since I want to create a summary table and you can't recreate raw data from summaries. The idea was to export the table data as CSV, and then when I had the time to for messing around with summaries I'd reload the data. With FDW I can leave the data as CSV and read it just like it was in a proper DB table. Win!

Or even better, again at $work we use Mysql. But I hate it because SQL strict mode isn't enabled and mysql does really stupid things with my data in "normal" mode. I can't safely turn on strict at this point because I don't have the time for testing everything. I also really like Pg and would love to switch. But again, I don't have the time to do it. What I've been thinking about is using the FDW capabilities (in particular the upcoming writable FDW in 9.3) as a way to safely and slowly migrate DB platforms. It's only an idea in my head right now, but it's an intriguing one.

languagehacker12y ago

I kind of think the more common use case is accessing JSON data _without_ defining a schema. In fact I wouldn't mind loading the data so long as it doesn't persist for too long. If only there was a schemaless database that allowed you to set a time to live on the JSON objects you load into it... that would be humongous.

tantalor12y ago

How is this "without any data loads"? You have to map each file to a database table and then execute SQL against that database. Isn't that exactly what "loading into a database" means?

femami12y ago

Loading into database usually means transforming the data such that the data is stored as database's internal format. For example, using "LOAD DATA ..." in MySQL or "COPY ... FROM ..." in PostgreSQL.

Here, you aren't storing the data in the database. You don't have to take any additional action to sync the file and db when you add rows to the file, so it's not loaded into the database.

cwsteinbach12y ago

Another benefit of this approach is that it allows you to save storage space by eliminating the need to create a copy of the JSON data in the DB's own internal format.

siddboots12y ago

> You have to map each file to a database table and then execute SQL against that database.

The create statement is just a declaration, and doesn't load data into a postgres table. Instead, it just tells postgres how to access the file. So if I then do a query that is limited to the first few rows, it will only ever need to read the first few "rows" of the file.

caycep12y ago

actually, I might use this thread to ask a noob question - I've been trying to fiddle around with some data that is relatively schemaless (i.e. I have a list of medical research subjects, and each of them have varying numbers of prior medical conditions, medication allergies, and list of current and past medicines).

Naturally, I thought JSON was better than relational DB's for this, but I wasn't sure how best to store this data. The relation data that I do have is stashed in .csv files that I'm transitioning over to sqlite, so I was wondering if there was a sqlite-esque DB for JSON - I was assuming something like Couch or Redis. But am I missing something, and that JSON itself IS the database?

femami12y ago

I think if your data is append-only, this should work good enough. You implement your data appending logic, and you use json_fdw for querying your data. But this requires starting PostgreSQL service, I'm not sure if this is acceptable in your application or not.

fmariluis12y ago

You can also use Ojota (a flat file database with ORM - https://bitbucket.org/msa_team/ojota/overview) for something like this.

spullara12y ago

You can use YQL to do this — even join with other data sources.

http://developer.yahoo.com/yql/console/

rjurney12y ago

Apache Pig and Hive are other systems that can do this, minus the one file limitation.

femami12y ago

json_fdw to Citus DB or any other PostgreSQL based distributed database, is like json serde to Hive.

So, I think you should probably compare the json_fdw to json serde, and Citus DB to Hive.

kashnikov12y ago

I'd rather see the code they used to crawl Amazon reviews...

femami12y ago

Is this useful? http://www.esuli.it/software/amazon-reviews-downloader-and-p...

binarysolo12y ago

Commenting to save thread -- looks great!

dksidana12y ago

Interesting

j / k navigate · click thread line to collapse

35 comments

ot12y ago

I wonder if would be useful to integrate the semi-index code [2] in this json_fdw; thanks for sharing the code, I'll try to see how feasible this is.

[1] http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf

[2] https://github.com/ot/semi_index

mmastrac12y ago

[1] https://github.com/mmastrac/nanojson

ot12y ago

Very interesting! Implementing a semi-index is very easy if you have implementations of Elias-Fano and balanced parentheses data structures.

[1] http://sux.dsi.unimi.it/

mtdewcmu12y ago

mortov12y ago

I remember when it caught my eye as Postgres95 and I've always preferred it, even in the face of the huge surge for MySQL which left PostgreSQL looking like an also-ran for quite while.

The reality is it is one of the most amazing open source projects I know - enough to rival Linux itself for me.

The developers should be receiving far more kudos than they do.

d4nt12y ago

Admittedly, it's aimed at novice users so if you're comfortable with SQL it may feel limiting.

MartinCron12y ago

It's not spamming if it's relevant.

knowtheory12y ago

I am mystified as to when this tool would be useful (can anybody else think of a practical use case?).

You can run SQL queries on individual JSON files... but you have to have PostgreSQL installed already.

And you can't run it against multiple JSON files, only one, with a single JSON object at the root.

ozgune12y ago

(Ozgun at Citus Data)

1. You can create a partitioned PostgreSQL table, and declare one child table per JSON file. You can also declare constraints on the child table to filter out irrelevant files to the query.

2. You can create a distributed foreign table on CitusDB, even if you'd like to run on one node. In this case, we'll collect statistics automatically for you, and filter out irrelevant files.

3. If neither of these fit, you could change the source code to scan over one directory instead of a file. We didn't go down this path to be compatible with file_fdw's semantics.

tracker112y ago

neilk12y ago

That is:

- you can take heterogenous JSON documents, and then make them appear to the Postgres world as separate tables. (Which they have to be, because they're not uniform)

- you can then take some kinds of SQL queries and distribute them among these tables, as if they were one big SQL table. If some pseudo-tables have missing or extra columns, no problem.

ozgune12y ago

Thanks for the comment. Yes, that's exactly the benefit.

If that's the case, you could create a distributed table. That way, you could run the query one distributed table and have the query run on all your JSON files instead of just one.

knowtheory12y ago

It should be said I think that it's super cool that you guys are putting Postgres to broader use. More stuff can and should be done with it, i'm just not sure what to do with this particular tool :)

chaz12y ago

Keep in mind that this blog post was written by a company builds a product that sits on top of PostgreSQL. So for the intended audience, it's completely relevant.

malkia12y ago

Some automation could be done as well.

E.g. - if you use Perforce: - Someone syncs - The DB automatically picks up the changes, informed by the file system, and updates the DB.

The more of such things the better.

danielrhodes12y ago

It's a good way to do adhoc querying of JSON log files, which among other things, can be useful for debugging.

masklinn12y ago

Can also be useful to mix internal SQL data from JSON from an external source without having to manually sync/import it no?

seppo001012y ago

The only valid use case I can think of is for preloaded data for automatized tests.

knowtheory12y ago

ajtaylor12y ago

languagehacker12y ago

tantalor12y ago

How is this "without any data loads"? You have to map each file to a database table and then execute SQL against that database. Isn't that exactly what "loading into a database" means?

femami12y ago

Loading into database usually means transforming the data such that the data is stored as database's internal format. For example, using "LOAD DATA ..." in MySQL or "COPY ... FROM ..." in PostgreSQL.

Here, you aren't storing the data in the database. You don't have to take any additional action to sync the file and db when you add rows to the file, so it's not loaded into the database.

cwsteinbach12y ago

Another benefit of this approach is that it allows you to save storage space by eliminating the need to create a copy of the JSON data in the DB's own internal format.

siddboots12y ago

> You have to map each file to a database table and then execute SQL against that database.

caycep12y ago

femami12y ago

fmariluis12y ago

You can also use Ojota (a flat file database with ORM - https://bitbucket.org/msa_team/ojota/overview) for something like this.

spullara12y ago

You can use YQL to do this — even join with other data sources.

http://developer.yahoo.com/yql/console/

rjurney12y ago

Apache Pig and Hive are other systems that can do this, minus the one file limitation.

femami12y ago

json_fdw to Citus DB or any other PostgreSQL based distributed database, is like json serde to Hive.

So, I think you should probably compare the json_fdw to json serde, and Citus DB to Hive.

kashnikov12y ago

I'd rather see the code they used to crawl Amazon reviews...

femami12y ago

Is this useful? http://www.esuli.it/software/amazon-reviews-downloader-and-p...

binarysolo12y ago

Commenting to save thread -- looks great!

dksidana12y ago

Interesting

j / k navigate · click thread line to collapse