We're in the process of switching to a more balanced schema (mentioned in this post) and the results have been pretty good so far.
Another win has been that the better stats make it possible to reliably get bitmap joins from the planner. Our configuration uses ~12 RAIDed ebs drives, so the i/o concurrency is really high and prefetching for a bitmap scan works particularly well.
1) Using jsonb to store user-generated forms and submissions to those forms. As an example, you can create a form with text inputs, checkboxes, etc., and others can submit responses with that form. I find that these forms and their submissions are best stored as jsonb because their contents are largely opaque (I don't care about the contents except where they are rendered on the client), their structure is highly dynamic, and their schema changes frequently.
2) As a specialized case on #2, applying filters to user-generated form submissions. jsonb supports subset operators (@> and <@, if i remember correctly), which makes easy work of dynamically filtering form submissions on custom form fields even for complex filter conditions.
3) Storing/munging/slicing relatively low-volume log data is fantastic with jsonb. This is always for admin/diagnostic reasons, so it's not as performance-critical, and the ability to group on and do subset operations on jsonb fields makes slicing your data really easy.
Using ->> (or ->) in a WHERE statement is generally a bad idea, and certainly a terrible idea without an explicit index. Use @> instead.
I read a presentation titled "Powering Heap" by Dan Robinson, Lead Engineer at Heap, which contains interesting info about how you use PostgreSQL. [1]
At Heap, do you try to keep rows belonging to the same customer_id contiguous on disk, in order to minimize disk seeks?
If yes, how do you it? Do you use something like pg_repack?
If no, don't you suffer from reading heap pages that contain only one or a few rows belonging to the requested customer_id?
[1] http://info.citusdata.com/rs/235-CNE-301/images/Powering_Hea...
Currently, maintaining the clustering has only been best-effort. We sort our data whenever we copy it from one location to another and the data comes in sorted by time, so it's fairly easy to maintain a high row correlation with time.
It doesn't mean there is no data model (schema), and it doesn't mean that the data model is flexible. It actually means you have a succession of distinct and undocumented schemas, which are updated on a haphazard, ad hoc basis, with no documentation or thought given to this event.
Every version of every app and every support program ever written then has to know about each and every historic variant of the data model that ever existed. This is a maintenance nightmare when your app is more than a few iterations old, and when you have several decoupled support systems trying to use the database.
With an overt schema, you are required to at least think about what you're doing and to do it in a centralized fashion, rather than slip changes in haphazardly in any app that ever touches the database, and you're required to ensure that the data already in the database actually conforms to the new schema. You won't have one app that puts the work email in "work-email" with a dash, and another that tries to use "work_email" with an underscore, for example.
I've seen otherwise...
Having learned that postgresql NULLs are cheap in storage (since they just use a bitmap) I stopped caring much about empty column proliferation.
I've also found that arbitrary user-defined keys are still better treated via a "known keys" table, the long-term visibility into data is invaluable, there's more join flexibility, and there's opportunity to assign metadata (e.g. type hints) for each user-defined key. At scale, these things make json columns look like technical debt.
The only use case I currently have for jsonb is when the absence of an element is meaningfully different to a null value e.g. options structures that wouldn't otherwise benefit from being in normalized tables.
Is it a table with columns like user_defined_text_1, user_defined_text_2, ..., user_defined_text_10, user_defined_integer_1, user_defined_integer_2, ... user_defined_integer_n ?
Honest question - what settings would many optional values be impractical or impossible? Is it purely space/performance constraints? If so, it doesn't sound like JSONB gives you wins in either of those cases.
1. You could have a table for every variant of the form
2. You could have a table with every field that's ever been used as a nullable field and add new columns every time a new version of the form appears
3. You could have a table with columns for the important fields and a JSONB column for the variable data
Of the three options, 3 seems the most elegant to me. The other two are basically allowing the ugliness of the source data to spew complexity all over your database.
Depending on your RDBMS this is not actually so bad. SQL Server has sparse column support which helps to make this sort of schema practical. It comes with some strings attached, however [1].
The latter is a sort of "inheritance" done in SQL and I had used it with good results in the past.
In one such example would be a transaction table, where the transaction info I care about are dedicated columns, but additional details from say paypal, or another payment processor would be stored as JSON, as there is little value to parsing the detailed information into a unified format, that will always have "extra" information for that processor, and without the need of dedicated tables for each payment processor.
Another example would be logging, where some information is consistent, but other additional information is mainly just stored. I also like jsonl (line terminated json, one record per line) in a gzip stream works very well... as an aside from a db server.
There are other considerations as well, in some cases a document store such as mongo or rethinkdb may be closer to what you ideally want, for that matter you may need something closer to Cassandra.
Traditional SQL database have column names that are stored separately from the table data, but in JSON, the keys are in the data.
I've worked with MongoDB systems that held 100 terabytes of data. At that scale, we had to re-write all of the keys so they were only a single character. When JSON is small, it is pleasant to have keys that make sense, such as:
username : mary
but as JSON gets big, this needs to change to:
u : mary
If you have 10 million records, the difference (in this trivial example) is 70 million characters, and if you have 100 different collections with an average of 20 keys each, then you are talking about 140 billion characters saved.
Most of the companies that I know, if they work with JSON document stores at scale (such as MongoDB) eventually take steps to make the keys smaller.
This may make sense if you have custom fields, i.e. the set of keys is user-defined and ever increasing.
Apart from that, there is no reason to do this. If it is all about avoiding having "too many" columns and/or "too may" null values, then I'd say: Don't worry. Just use as many columns as you need.
I would even go one step further and say: It is a common anti-pattern to introduce generic key-value stores (through a separate table, JSON structures or XML structures) without a compelling reason.
With JSONB I can receive a boolean element with five states: null, true, false, invalid (i.e. actually a string, number or array value), or simply nonexistent, and it's up to the application to deal with all such cases. You may have to treat the database as a source of potentially invalid data that must be sanitized.
With regular DB columns you can reasonably assume that a boolean not null column will a) exist for all records and b) return either true or false. You need only scan the database schema at application start to verify this.
On the other hand, if you're using JSONB for an options structure then the possibility of key nonexistence may actually be useful, since it implies "use system default". (Yes this can differ from the meaning of a NULL value, especially if you are merging options structures).
You might question whether it is wise to have a "boolean" key/value pair with four semantically distinct states. I can only say I have done this and I am not proud of it.
This is an anti-pattern because you can't model "1:1" that way. Instead, it will be "1:0..1", and now you have some nasty corner cases when the first table has an entry whose counterpart in the second table is missing. Also, when using a column now you always have to think about which table had it - the first or the second table?
I had to work with such a design in a real-world project and it was really annoying.
Taking the well-defined subset of searchable properties and making them columns, as described in the article, is the really the best solution.
actually we store only 6 things inside the table and the rest inside the jsonb.
however we sill miss like 8 values which we are using inside a list, which are slow, but materialized view to the rescue. however we may pull them out at some point, still need to figuring out since the jsonb data set is also the value of a hash. that checks the validity of the data inside their which we use for change detection against other stuff
It is much better for data that is stored once and then queried.
Can't this be solved by introducing an expression index[1] for "record ->> 'value_2'"?
This would add a specialized index that will be used of all queries that have a filter like "WHERE record ->> 'value_2' = 0 AND ...".
[1] https://www.postgresql.org/docs/current/static/indexes-expre...
In fact, this might make the query slower. If postgres thinks it is selecting a very small number of rows, it will prefer an index scan of some kind, but a full table scan will be faster if it's retrieving 1/8th of the table (at least, for small rows like these). So, you might get a slower row retrieval and the same explosively slow join.
In the worked example, adding expression indices for the integer values of value_1, value_2, and value_3 makes the JSONB solution only marginally less efficient than the full-column solution. On my computer, ~300ms instead of ~200ms.
(This is Postgres 9.5)
The storage is almost x2, accessing attributes are expensive than tabular data even though JSONB is indexed, the data can be dirty (the client can send extra attributes or invalid values for existing attributes, since JSONB doesn't have any schema, Postgresql doesn't validate and cast the values) and as the author mentioned, Postgresql statistics and indexes don't play nicely with JSONB.
Seems like a basic feature to improve space savings and processing speed.
Am I missing something on why this isnt done already and automatically?
IMHO anything that is to be queried against regularly should be normalized into an actual column.
But, we had a situation where we had a "user_meta" equivalent, but wanted to support different data types (and even possibly nested data) using JSONB allowed for simple modelling of something like `{ "age": 1, "school": "blah", "something": { "in": "depth" } }` which isn't as simple using an extra "meta" table.
(Not to say it's the best thing to do (depending on the situation it might be better to have explicit columns for those fields) but it's an example of how it can be more powerful than just having an extra "meta" table.)
But as an alternative to things like serialized objects, I think it's definitely a huge win. You can do things like join a jsonb object property to its parent table, which wouldn't be possible with serialized objects.
If you're developing an application in-house, this is not a big deal - you can make sure you have the right PostgreSQL version. If you're hosting the application on a shared database server, well, you're exactly in the same situation as with other software products.
This is mostly fixed in 9.5: http://blog.2ndquadrant.com/jsonb-and-postgresql-9-5-with-ev...
Protip: Use the planner config settings[1] (one of which is mentioned in this article) with SET LOCAL in a transaction if you're really sure the query planner is giving you guff. On more structured data that Postgres can calculate statistics on, let it do its magic.
[1]: https://www.postgresql.org/docs/current/static/runtime-confi...
This happens even in development before going live for the first time, and way more often as you keep changing sw. Even if you throw away the data every time you still have to update the seeding scripts (in a relational db you have seeding + schema changes).
Anyway, what did you do? Did you keep using JSONB with that planner config setting or did you extract some data to ordinary columns?
As for the above story, that engineer was sort of on his way out at the time, so I used the above method to provide query hints as a short term fix. After he left, I was able to restructure the event data schema to make more use of columns. Some of the ancillary attributes that weren't used for row selection stayed as jsonb, but things like timestamp, event name, user id, etc. were moved to columns.
We've wanted to do this but the last I checked, Citus, the software we use to shard our postgres databases, isn't able to handle setting configs in a query.