I wonder if would be useful to integrate the semi-index code [2] in this json_fdw; thanks for sharing the code, I'll try to see how feasible this is.
[1] http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf
For Java I recommend Sux4j [1] that has a very good implementation of Elias-Fano, but I think that balanced parentheses are very primitive. I was told by the author that a better data structure, based on Range-Min-Max trees, should be added soon.
The reality is it is one of the most amazing open source projects I know - enough to rival Linux itself for me.
These little snippets of PostgreSQL features and facilities which keep being posted give just a tiny hint of the truly enterprise grade features and facilities it has - I just wish I still had as much time to keep on top of it as I once did but I still keep it as my go-to database for project development. I've even dropped it in to replace full blown Oracle systems for serious cost savings.
The developers should be receiving far more kudos than they do.
Admittedly, it's aimed at novice users so if you're comfortable with SQL it may feel limiting.
You can run SQL queries on individual JSON files... but you have to have PostgreSQL installed already.
And you can't run it against multiple JSON files, only one, with a single JSON object at the root.
You're right that you need to have Postgres installed. For running SQL over multiple JSON files, we wanted to keep the blog post short and noted several different ways to go over multiple files in our GitHub Readme.
1. You can create a partitioned PostgreSQL table, and declare one child table per JSON file. You can also declare constraints on the child table to filter out irrelevant files to the query.
2. You can create a distributed foreign table on CitusDB, even if you'd like to run on one node. In this case, we'll collect statistics automatically for you, and filter out irrelevant files.
3. If neither of these fit, you could change the source code to scan over one directory instead of a file. We didn't go down this path to be compatible with file_fdw's semantics.
In the file system, to prevent too many records in a single directory it was split up per 1000 records... base/00001000/(1000-1999).json.gz ... this was mainly for being able to navigate this structure via a gui. I would suggest if your system can't do "basepath//*.json.gz" that you consider it.
That is:
- you can take heterogenous JSON documents, and then make them appear to the Postgres world as separate tables. (Which they have to be, because they're not uniform)
- you can then take some kinds of SQL queries and distribute them among these tables, as if they were one big SQL table. If some pseudo-tables have missing or extra columns, no problem.
A good use case is exploring external APIs, many of which come back as JSON. This might be a faster awy to figure out what you're dealing with: how large, how many, unique values, mins, maxes, etc. It could be especially useful if you need to match it to existing keys or metadata that you already have in the same database. To solve this same problem before, I've written lots of one-off scripts that convert the JSON into a CSV so that I can view it in Excel. This may or may not help me there, but good to see another tool.
It's important for these files to be kept this way, because they can be submitted to a SCM (perforce, svn, etc.), and this is the typical workflow. An SQL/noSQL db won't cut it here, since there is no revision number, or even if there is, it introduces a problem of syncing what's in the DB vs. what's in SCM.
So such tool would become interresting, if after syncing your latest data of the depot (repo), you run a tool that imports them into the DB to be used by other tools. And when you are done, export is done to the same files, and submitted.
Some automation could be done as well.
And having such tools would be beneficial, since you now don't have to write another tool to be ran manually, or being resident as service to get the latest file changes and update the DB, but could be make as an PostgreSQL worker that imports on the fly changes and updates the DB at the same time.
E.g. - if you use Perforce: - Someone syncs - The DB automatically picks up the changes, informed by the file system, and updates the DB.
The more of such things the better.
For instance, at $work we have a bunch of tables which are archives of raw click data. They are taking up quite a bit of space on disk, which we'd like to reclaim. Keeping the raw data around is good, since I want to create a summary table and you can't recreate raw data from summaries. The idea was to export the table data as CSV, and then when I had the time to for messing around with summaries I'd reload the data. With FDW I can leave the data as CSV and read it just like it was in a proper DB table. Win!
Or even better, again at $work we use Mysql. But I hate it because SQL strict mode isn't enabled and mysql does really stupid things with my data in "normal" mode. I can't safely turn on strict at this point because I don't have the time for testing everything. I also really like Pg and would love to switch. But again, I don't have the time to do it. What I've been thinking about is using the FDW capabilities (in particular the upcoming writable FDW in 9.3) as a way to safely and slowly migrate DB platforms. It's only an idea in my head right now, but it's an intriguing one.
Here, you aren't storing the data in the database. You don't have to take any additional action to sync the file and db when you add rows to the file, so it's not loaded into the database.
The create statement is just a declaration, and doesn't load data into a postgres table. Instead, it just tells postgres how to access the file. So if I then do a query that is limited to the first few rows, it will only ever need to read the first few "rows" of the file.
Naturally, I thought JSON was better than relational DB's for this, but I wasn't sure how best to store this data. The relation data that I do have is stashed in .csv files that I'm transitioning over to sqlite, so I was wondering if there was a sqlite-esque DB for JSON - I was assuming something like Couch or Redis. But am I missing something, and that JSON itself IS the database?
So, I think you should probably compare the json_fdw to json serde, and Citus DB to Hive.