When to Avoid JSONB in a PostgreSQL Schema (opens in new tab)

(blog.heapanalytics.com)

183 pointsmatm9y ago113 comments

113 comments

drob9y ago

Author here. Curious what experiences y'all have had with JSONB.

We're in the process of switching to a more balanced schema (mentioned in this post) and the results have been pretty good so far.

Another win has been that the better stats make it possible to reliably get bitmap joins from the planner. Our configuration uses ~12 RAIDed ebs drives, so the i/o concurrency is really high and prefetching for a bitmap scan works particularly well.

iEchoic9y ago

I've found a few good applications for jsonb so far:

1) Using jsonb to store user-generated forms and submissions to those forms. As an example, you can create a form with text inputs, checkboxes, etc., and others can submit responses with that form. I find that these forms and their submissions are best stored as jsonb because their contents are largely opaque (I don't care about the contents except where they are rendered on the client), their structure is highly dynamic, and their schema changes frequently.

2) As a specialized case on #2, applying filters to user-generated form submissions. jsonb supports subset operators (@> and <@, if i remember correctly), which makes easy work of dynamically filtering form submissions on custom form fields even for complex filter conditions.

3) Storing/munging/slicing relatively low-volume log data is fantastic with jsonb. This is always for admin/diagnostic reasons, so it's not as performance-critical, and the ability to group on and do subset operations on jsonb fields makes slicing your data really easy.

mcdee9y ago

I'm using JSONB and the downsides on performance are not noticeable for most cases. For those where there are real problems then crafting a custom index usually fixes the issue.

Using ->> (or ->) in a WHERE statement is generally a bad idea, and certainly a terrible idea without an explicit index. Use @> instead.

malisper9y ago

Using @> instead of ->> only causes the selectivity estimate of the predicate to be a different hard coded estimate. It doesn't fix the underlying problem of Postgres not keeping statistics on JSONB.

1 more reply

ngrilly9y ago

I have an unrelated question :-)

I read a presentation titled "Powering Heap" by Dan Robinson, Lead Engineer at Heap, which contains interesting info about how you use PostgreSQL. [1]

At Heap, do you try to keep rows belonging to the same customer_id contiguous on disk, in order to minimize disk seeks?

If yes, how do you it? Do you use something like pg_repack?

If no, don't you suffer from reading heap pages that contain only one or a few rows belonging to the requested customer_id?

[1] http://info.citusdata.com/rs/235-CNE-301/images/Powering_Hea...

malisper9y ago

Most of our queries depend more on the time of the events rather than the user the events belong to. For example, let's say you want to know how many users signed up on Monday and logged in again before Friday. That query would fetch all sign up events and all log in events over the rest of the week, do a group by user_id, and use a custom udf to perform the aggregation. We never actually fetch multiple events from a user at a time. Instead we look for specific types of events in a given time range and group by the user. Clustering by time winds up being a much bigger win (benchmarks showed 10x compared to sorted by user for some uncached queries) here as almost all of our queries are constrained to a given time period.

Currently, maintaining the clustering has only been best-effort. We sort our data whenever we copy it from one location to another and the data comes in sorted by time, so it's fairly easy to maintain a high row correlation with time.

1 more reply

willlll9y ago

Was the 30% disk saving over petabyte+ data set on a single-node Postgres or on your Citus cluster?

drob9y ago

This isn't live yet, but we expect it to be across our citus cluster. The ~30% figure comes from the profiling we did on individual postgres nodes.

tgarma12349y ago

The fact that you can pass attributes for a record into the JSONB field without defining the table structure in advance is really the decisive feature because then you never have to bother with changing your data model. For example, if you have contacts streaming into your table from Android devices you don't need to say whether or not there should be a column for "work email" and "home email2" etc etc... you just send everything into a column with key/value pairs and you can put whatever key/value pairs you want in that column. And then query over the keys without inserting a gazillion nulls into your database for rows that don't have a value for a particular key. You can also do updates on the json column AND you get all of the benefits of relational databases with other columns in the same table. I can't really see how anyone would not love this data type now that I have been exposed to it in production.

pjlegato9y ago

This is an attractive trap. This mode of data modelling is actually quite terrible in terms of maintainability. It is precisely the problem with NoSQL database models.

It doesn't mean there is no data model (schema), and it doesn't mean that the data model is flexible. It actually means you have a succession of distinct and undocumented schemas, which are updated on a haphazard, ad hoc basis, with no documentation or thought given to this event.

Every version of every app and every support program ever written then has to know about each and every historic variant of the data model that ever existed. This is a maintenance nightmare when your app is more than a few iterations old, and when you have several decoupled support systems trying to use the database.

With an overt schema, you are required to at least think about what you're doing and to do it in a centralized fashion, rather than slip changes in haphazardly in any app that ever touches the database, and you're required to ensure that the data already in the database actually conforms to the new schema. You won't have one app that puts the work email in "work-email" with a dash, and another that tries to use "work_email" with an underscore, for example.

tgarma12349y ago

I don't agree with that because you really just query the keys for users that have the attribute in question without inserting a ton of nulls and having to work out changes to the attributes when you realize "I wish I would have thought of that column" several iterations later. Spending time up front agonizing about the "overt schema" strikes me as an anti-pattern when you are likely going to pivot later and it's super convenient to just use the same tables as you have been using all along. New users are different in that case but not so totally different so just but the new attributes into the column keyed on "pivot1" or whatever.

1 more reply

azinman29y ago

You think database schemes are somehow immune to evolve haphazardly over time?

I've seen otherwise...

2 more replies

sk5t9y ago

Multiple, independent apps poking changes into the database is a kind of failure too. For writer apps > 1 it is often preferable to route them through a middle tier / web API / etc.

2 more replies

inopinatus9y ago

> "Query over the keys without inserting a gazillion nulls into your database for rows that don't have a value for a particular key".

Having learned that postgresql NULLs are cheap in storage (since they just use a bitmap) I stopped caring much about empty column proliferation.

I've also found that arbitrary user-defined keys are still better treated via a "known keys" table, the long-term visibility into data is invaluable, there's more join flexibility, and there's opportunity to assign metadata (e.g. type hints) for each user-defined key. At scale, these things make json columns look like technical debt.

The only use case I currently have for jsonb is when the absence of an element is meaningfully different to a null value e.g. options structures that wouldn't otherwise benefit from being in normalized tables.

ngrilly9y ago

What is a "known keys" table?

Is it a table with columns like user_defined_text_1, user_defined_text_2, ..., user_defined_text_10, user_defined_integer_1, user_defined_integer_2, ... user_defined_integer_n ?

1 more reply

000983459y ago

This is good but please, to preserve your own sanity, first design the table struct properly to capture all attributes you can think of in separate columns and add the appropriate primary key. Then add the extra JSON(B) column for unforeseen data. Keep iterating till the table is still small.

tgarma12349y ago

You will never think of all the attributes anymore. For example 8 months later you want to score your contacts based on derived or imputed values. Who knows what you will want to slap in there when you scale up to 5 billion rows and start including click data from your website for example... there are basics from relational concepts that are obvious, which is why you are using postgres in the first place but table structure is not one of them.

1 more reply

leothekim9y ago

"For datasets with many optional values, it is often impractical or impossible to include each one as a table column."

Honest question - what settings would many optional values be impractical or impossible? Is it purely space/performance constraints? If so, it doesn't sound like JSONB gives you wins in either of those cases.

chc9y ago

Let me give an example: We process some large forms (say 100+ fields). We care a lot about maybe 15 fields off these forms, but we use the others sometimes. These forms change every so often, with some fields disappearing and new fields joining in. The fields we care a lot about don't change, but the ones we care a little about do. A few possibilities here:

1. You could have a table for every variant of the form

2. You could have a table with every field that's ever been used as a nullable field and add new columns every time a new version of the form appears

3. You could have a table with columns for the important fields and a JSONB column for the variable data

Of the three options, 3 seems the most elegant to me. The other two are basically allowing the ugliness of the source data to spew complexity all over your database.

cstavish9y ago

I use JSONB columns for similar use-cases, but to play devil's advocate, you can accomplish that a fourth way, which is almost certainly better than 1 or 2. A table for fields, one row per field. A table for forms, which has a many-to-many relation to fields. Entries in a link table compose a form of arbitrary fields. Answers can be stored in a separate responses table, indexed by form_id and column_id. I don't know enough about database implementation to speculate on how that would perform at scale, but conceptually that's how I think of the problem.

malisper9y ago

This is roughly EAV. EAV is where you have a with a schema:

    entity_id | attribute_id | value

EAV is typically considered to be an anti-pattern for several reasons: it becomes very expensive to rematerialize a entity with all of its attributes/values, it becomes difficult to have any kind of constraints on the values (this is also a problem with jsonb), and it's hard for the database to maintain good statistics on the values of different attributes as most databases don't keep multi-column statistics. Don't worry, I've had similar ideas before.

1 more reply

taspeotis9y ago

> 2. You could have a table with every field that's ever been used as a nullable field and add new columns every time a new version of the form appears

Depending on your RDBMS this is not actually so bad. SQL Server has sparse column support which helps to make this sort of schema practical. It comes with some strings attached, however [1].

[1] https://msdn.microsoft.com/en-AU/library/cc280604.aspx

Pamar9y ago

When you say 1. You could have a table for every variant of the form do you actually mean one table for each version of the form, or more like one table for the 15 fields you care about+1 field for the form version # (sort of a base class) and then one extra table to capture extra fields when needed?

The latter is a sort of "inheritance" done in SQL and I had used it with good results in the past.

tracker19y ago

For me a great example is additional/diagnostic information related to other systems...

In one such example would be a transaction table, where the transaction info I care about are dedicated columns, but additional details from say paypal, or another payment processor would be stored as JSON, as there is little value to parsing the detailed information into a unified format, that will always have "extra" information for that processor, and without the need of dedicated tables for each payment processor.

Another example would be logging, where some information is consistent, but other additional information is mainly just stored. I also like jsonl (line terminated json, one record per line) in a gzip stream works very well... as an aside from a db server.

There are other considerations as well, in some cases a document store such as mongo or rethinkdb may be closer to what you ideally want, for that matter you may need something closer to Cassandra.

susan_hall9y ago

Space constraints can be a big problem with column keys, at sufficient scale.

Traditional SQL database have column names that are stored separately from the table data, but in JSON, the keys are in the data.

I've worked with MongoDB systems that held 100 terabytes of data. At that scale, we had to re-write all of the keys so they were only a single character. When JSON is small, it is pleasant to have keys that make sense, such as:

username : mary

but as JSON gets big, this needs to change to:

u : mary

If you have 10 million records, the difference (in this trivial example) is 70 million characters, and if you have 100 different collections with an average of 20 keys each, then you are talking about 140 billion characters saved.

Most of the companies that I know, if they work with JSON document stores at scale (such as MongoDB) eventually take steps to make the keys smaller.

vog9y ago

I was also wondering about this one.

This may make sense if you have custom fields, i.e. the set of keys is user-defined and ever increasing.

Apart from that, there is no reason to do this. If it is all about avoiding having "too many" columns and/or "too may" null values, then I'd say: Don't worry. Just use as many columns as you need.

I would even go one step further and say: It is a common anti-pattern to introduce generic key-value stores (through a separate table, JSON structures or XML structures) without a compelling reason.

inopinatus9y ago

I'll add another reason: subtle application issues creep in with value states.

With JSONB I can receive a boolean element with five states: null, true, false, invalid (i.e. actually a string, number or array value), or simply nonexistent, and it's up to the application to deal with all such cases. You may have to treat the database as a source of potentially invalid data that must be sanitized.

With regular DB columns you can reasonably assume that a boolean not null column will a) exist for all records and b) return either true or false. You need only scan the database schema at application start to verify this.

On the other hand, if you're using JSONB for an options structure then the possibility of key nonexistence may actually be useful, since it implies "use system default". (Yes this can differ from the meaning of a NULL value, especially if you are merging options structures).

You might question whether it is wise to have a "boolean" key/value pair with four semantically distinct states. I can only say I have done this and I am not proud of it.

1 more reply

vog9y ago

By the way, another related anti-pattern is the introduction of an additional table that has a 1:1 relationship to the original table. Because the original table was "full" of columns, or something, so a second table had to be opened to contain the remaining columns.

This is an anti-pattern because you can't model "1:1" that way. Instead, it will be "1:0..1", and now you have some nasty corner cases when the first table has an entry whose counterpart in the second table is missing. Also, when using a column now you always have to think about which table had it - the first or the second table?

I had to work with such a design in a real-world project and it was really annoying.

2 more replies

colanderman9y ago

Just a note that NULL columns take up very little space in Postgres. IIRC there is a NULL bitmap for every row. So yes, don't worry and just use columns, if the columns are actually part of your schema.

1 more reply

malisper9y ago

At Heap, we allow users to send custom event properties through our API. Since we don't know what properties users will send in advance, we need to use something like JSONB to store them.

leothekim9y ago

Maybe I'm missing something, but I think of optional columns as nullable (but declared) values. It sounds like you use JSONB to store arbitrarily declared values. If so, then I'm still confused then by how you're able to hoist values from JSONB data to save on perf and space. That implies these values weren't that arbitrary to begin with.

2 more replies

wvenable9y ago

That's perfectly reasonable. But then you also let them query on those arbitrary custom properties and that's where the performance issues are? If so, that's a fairly hard problem to solve.

Taking the well-defined subset of searchable properties and making them columns, as described in the article, is the really the best solution.

1 more reply

merb9y ago

actually we have a table which would contain over 100 fields. while we only need to query 10.

actually we store only 6 things inside the table and the rest inside the jsonb.

however we sill miss like 8 values which we are using inside a list, which are slow, but materialized view to the rescue. however we may pull them out at some point, still need to figuring out since the jsonb data set is also the value of a hash. that checks the validity of the data inside their which we use for change detection against other stuff

colanderman9y ago

You can index the contents of lists in Postgres using GIN indexes.

mpolichette9y ago

I would think of any system where users can store additional 'custom fields.' In these cases, it is impossible to include columns for these as they aren't known yet.

simiano9y ago

Keep in mind that Heap collects a huge amount of data and has a huge dataset. I like this[1] talk, it gives you an overview of their architecture. Great post though, thank you for sharing.

[1] https://www.youtube.com/watch?v=NVl9_6J1G60

IsmaOlvey9y ago

One thing I would add to the list: Don't use it for data that you need to change. There are (at least at present) no built-in functions to modify values within JSON(B) objects, which makes it very tedious modify data once it has been stored.

It is much better for data that is stored once and then queried.

vog9y ago

> It has no way of knowing, for example, that record ->> 'value_2' = 0 will be true about 50% of the time

Can't this be solved by introducing an expression index[1] for "record ->> 'value_2'"?

This would add a specialized index that will be used of all queries that have a filter like "WHERE record ->> 'value_2' = 0 AND ...".

[1] https://www.postgresql.org/docs/current/static/indexes-expre...

drob9y ago

The expression index will make it fast to retrieve the rows for which that predicate is true, but it won't help the planner know that this will be the case for 50% of rows, so I don't think it will change the join that the planner selects (which is the problem here).

In fact, this might make the query slower. If postgres thinks it is selecting a very small number of rows, it will prefer an index scan of some kind, but a full table scan will be faster if it's retrieving 1/8th of the table (at least, for small rows like these). So, you might get a slower row retrieval and the same explosively slow join.

parenthephobia9y ago

Postgres stores statistics for expression indexes, so it can know that the predicate is true for half the rows.

In the worked example, adding expression indices for the integer values of value_1, value_2, and value_3 makes the JSONB solution only marginally less efficient than the full-column solution. On my computer, ~300ms instead of ~200ms.

(This is Postgres 9.5)

1 more reply

warmwaffles9y ago

Gee go figure that using native columns is faster and better to query over.

buremba9y ago

In fact, JSONB is probably not a good idea when it comes to analytics.

The storage is almost x2, accessing attributes are expensive than tabular data even though JSONB is indexed, the data can be dirty (the client can send extra attributes or invalid values for existing attributes, since JSONB doesn't have any schema, Postgresql doesn't validate and cast the values) and as the author mentioned, Postgresql statistics and indexes don't play nicely with JSONB.

sgt9y ago

Most of our tables have a uuid, a jsonb entity along with relationships stored as uuid columns with a _fk suffix. We then index the FK's.

silverlight9y ago

Does adding a Gin index to the JSONB help this?

malisper9y ago

A Gin index only helps with querying the data from the table. It won't help with making the proper join choice or with getting a bitmap scan between multiple indexes. Additionally, you are unable to query numeric values by an inequality with a Gin index.

throwa9y ago

Does using a UNION OR UNION ALL query instead of a join query reduce the implication of jsonb column not having statistics, especially since the postgresql query planner might not use the nested loop join.

manigandham9y ago

Why dont document databases automatically save common keys in some kind of lookup table?

Seems like a basic feature to improve space savings and processing speed.

hendzen9y ago

It's much simpler to just use an off-the-shelf compression (zlib, lz4, etc) on the database pages. This basically has the same effect, but also compresses common values.

malisper9y ago

Postgres does use a variation of lz when using TOAST. TOAST is the technique Postgres uses to store attributes that are too big to store with the rest of the row. Unfortunately, TOAST only works within a single attribute, it doesn't work across multiple rows, or even across multiple attributes of the same row.

manigandham9y ago

In that case, why isnt that normal and key name size a non-issue with document stores and JSON columns?

Am I missing something on why this isnt done already and automatically?

2 more replies

tracker19y ago

How do you compare null, true, false, '', 'Y', 'True' in such a case, all are values that might wind up in a boolean field... though if you're using json-schema for your schemas, that's less of an issue in the common case.

IMHO anything that is to be queried against regularly should be normalized into an actual column.

manigandham9y ago

I'm not talking about the values, just the key names themselves... databases should be smart enough to realize that "username" and "id" and "timestamp" are keys repeated in most/every record and normalize them away so there isnt as big of a storage cost.

palmdeezy9y ago

Hey SQL newbie question: why use JSONB when you could split out tables into `user` and `user_meta`? isn't that how Wordpress works?

bruce_one9y ago

It all depends on what you're doing...

But, we had a situation where we had a "user_meta" equivalent, but wanted to support different data types (and even possibly nested data) using JSONB allowed for simple modelling of something like `{ "age": 1, "school": "blah", "something": { "in": "depth" } }` which isn't as simple using an extra "meta" table.

(Not to say it's the best thing to do (depending on the situation it might be better to have explicit columns for those fields) but it's an example of how it can be more powerful than just having an extra "meta" table.)

palmdeezy9y ago

Ahhhh so this will be super useful if I want to keep track of transaction events from a third party (like Shopify or Stripe). I could just keep table with `ID` `user_id` `time` `blob`.

rhinoceraptor9y ago

Using jsonb also brings the headache of having to worry about the version of Postgres you're using. Simple functionality like updating an object property in place might be missing in your version. And the documentation and stack-overflow-ability of json/jsonb is not very good yet.

But as an alternative to things like serialized objects, I think it's definitely a huge win. You can do things like join a jsonb object property to its parent table, which wouldn't be possible with serialized objects.

pgaddict9y ago

And why is this a headache? Every time a new feature is introduced, you have to worry about the PostgreSQL version.

If you're developing an application in-house, this is not a big deal - you can make sure you have the right PostgreSQL version. If you're hosting the application on a shared database server, well, you're exactly in the same situation as with other software products.

drob9y ago

We have a bag of utils internally to paper over the missing JSONB functions. This was definitely a headache at first.

This is mostly fixed in 9.5: http://blog.2ndquadrant.com/jsonb-and-postgresql-9-5-with-ev...

edoceo9y ago

I used JSONB to keep the Tags on an object. Used to be EAV. Works way better here

crorella9y ago

I usually use JSONB but keep a non jsonb column as the PK.

knucklesandwich9y ago

Definitely have been bitten with the query statistics issue before. I worked with a colleague once who was adamant that we build our backend on MongoDB, but I was able to convince him to build on Postgres because of it's JSONB support. I don't get why, since schema updates are generally very cheap with databases like Postgres (adding columns without a default or deleting columns is basically just a metadata change), but some developers believe its worth the headache of going schema-less to avoid migrations. In a sense, that suggestion kind of bit me in the ass when we started having some painfully slow report generation queries that should have been using indexes, but were doing table scans because of the lack of table statistics. In a much larger sense, I'm still thankful we never used MongoDB.

Protip: Use the planner config settings[1] (one of which is mentioned in this article) with SET LOCAL in a transaction if you're really sure the query planner is giving you guff. On more structured data that Postgres can calculate statistics on, let it do its magic.

[1]: https://www.postgresql.org/docs/current/static/runtime-confi...

pmontra9y ago

You need some migrations anyway or you'll get cruft in the db or worse. Think of old documents with extra fields your sw don't use anymore or without fields that are needed. Multiply by embedded documents and you get ton of problems you can solve only by taking care of data.

This happens even in development before going live for the first time, and way more often as you keep changing sw. Even if you throw away the data every time you still have to update the seeding scripts (in a relational db you have seeding + schema changes).

Anyway, what did you do? Did you keep using JSONB with that planner config setting or did you extract some data to ordinary columns?

knucklesandwich9y ago

Totally agree. At a previous job, some of the senior engineers decided to use MongoDB as the main data store, and doing migrations was among the worst things about it. I think some engineers envision that they'll just be able to do read repair and things will magically work. In practice, you can only really do read repair when you have a workload oriented to reading single records at a time and you have strict controls on concurrent access to prevent weird A-B-A errors with read repair. Complex aggregate queries are almost always impossible with read repair. Even with single records, read repair is still a pain in the ass. You often have to maintain unmarshalling code for several versions of a record formats. In the end, one of the engineers ended up having to write some internal migration tool (which was of course strictly worse than migrations via Postgres, because schema changes did require rewriting a table with update queries, so we ended up needing a bit of downtime). Even with the migration tool, shipping always required a lot of people on call, since migrations would inevitably break during the release process due to frequently brittle migration code.

As for the above story, that engineer was sort of on his way out at the time, so I used the above method to provide query hints as a short term fix. After he left, I was able to restructure the event data schema to make more use of columns. Some of the ancillary attributes that weren't used for row selection stayed as jsonb, but things like timestamp, event name, user id, etc. were moved to columns.

malisper9y ago

> Protip: Use the planner config settings[1] (one of which is mentioned in this article) with SET LOCAL in a transaction

We've wanted to do this but the last I checked, Citus, the software we use to shard our postgres databases, isn't able to handle setting configs in a query.

knucklesandwich9y ago

Ah, bummer, yeah that's a convenient short term fix. It sounds like you all have this handled pretty well though, I think you'll definitely appreciate the move to a more traditional schema. For events data, JSONB support can be nice for infrequently accessed attributes, so its not an all or nothing proposition, but I found I had a lot less headaches after adding more table structure.

agentgt9y ago

One of the tricks I have found for Postgres to manage analytics and unstructured data is using its inheritance with check constraints aka partitioning features. I mention it because not many people seem to know about this bad ass feature of postgres.

We use inheritance (with check constraints) [1] for both time partitioning as well as for custom (aka unstructured) events. Most events have several (100s in our case) common columns. When you see a new custom event with custom fields you want to separate out you can create a subtable and have it inherit from the base event table.

Now querying for those custom events is extremely fast and you can query from the base event type or the subtable directly.

[1]: https://www.postgresql.org/docs/current/static/ddl-partition...

1 more reply

kevincox9y ago

The nice thing about using postgres is that when you realize "Mongo style" was a bad idea you can write some migrations and take your common fields into proper columns quite easily.

j / k navigate · click thread line to collapse

113 comments

drob9y ago

Author here. Curious what experiences y'all have had with JSONB.

We're in the process of switching to a more balanced schema (mentioned in this post) and the results have been pretty good so far.

iEchoic9y ago

I've found a few good applications for jsonb so far:

mcdee9y ago

I'm using JSONB and the downsides on performance are not noticeable for most cases. For those where there are real problems then crafting a custom index usually fixes the issue.

Using ->> (or ->) in a WHERE statement is generally a bad idea, and certainly a terrible idea without an explicit index. Use @> instead.

malisper9y ago

Using @> instead of ->> only causes the selectivity estimate of the predicate to be a different hard coded estimate. It doesn't fix the underlying problem of Postgres not keeping statistics on JSONB.

1 more reply

ngrilly9y ago

I have an unrelated question :-)

I read a presentation titled "Powering Heap" by Dan Robinson, Lead Engineer at Heap, which contains interesting info about how you use PostgreSQL. [1]

At Heap, do you try to keep rows belonging to the same customer_id contiguous on disk, in order to minimize disk seeks?

If yes, how do you it? Do you use something like pg_repack?

If no, don't you suffer from reading heap pages that contain only one or a few rows belonging to the requested customer_id?

[1] http://info.citusdata.com/rs/235-CNE-301/images/Powering_Hea...

malisper9y ago

1 more reply

willlll9y ago

Was the 30% disk saving over petabyte+ data set on a single-node Postgres or on your Citus cluster?

drob9y ago

This isn't live yet, but we expect it to be across our citus cluster. The ~30% figure comes from the profiling we did on individual postgres nodes.

tgarma12349y ago

pjlegato9y ago

This is an attractive trap. This mode of data modelling is actually quite terrible in terms of maintainability. It is precisely the problem with NoSQL database models.

tgarma12349y ago

1 more reply

azinman29y ago

You think database schemes are somehow immune to evolve haphazardly over time?

I've seen otherwise...

2 more replies

sk5t9y ago

Multiple, independent apps poking changes into the database is a kind of failure too. For writer apps > 1 it is often preferable to route them through a middle tier / web API / etc.

2 more replies

inopinatus9y ago

> "Query over the keys without inserting a gazillion nulls into your database for rows that don't have a value for a particular key".

Having learned that postgresql NULLs are cheap in storage (since they just use a bitmap) I stopped caring much about empty column proliferation.

ngrilly9y ago

What is a "known keys" table?

Is it a table with columns like user_defined_text_1, user_defined_text_2, ..., user_defined_text_10, user_defined_integer_1, user_defined_integer_2, ... user_defined_integer_n ?

1 more reply

000983459y ago

tgarma12349y ago

1 more reply

leothekim9y ago

"For datasets with many optional values, it is often impractical or impossible to include each one as a table column."

chc9y ago

1. You could have a table for every variant of the form

2. You could have a table with every field that's ever been used as a nullable field and add new columns every time a new version of the form appears

3. You could have a table with columns for the important fields and a JSONB column for the variable data

Of the three options, 3 seems the most elegant to me. The other two are basically allowing the ugliness of the source data to spew complexity all over your database.

cstavish9y ago

malisper9y ago

This is roughly EAV. EAV is where you have a with a schema:

    entity_id | attribute_id | value

1 more reply

taspeotis9y ago

> 2. You could have a table with every field that's ever been used as a nullable field and add new columns every time a new version of the form appears

Depending on your RDBMS this is not actually so bad. SQL Server has sparse column support which helps to make this sort of schema practical. It comes with some strings attached, however [1].

[1] https://msdn.microsoft.com/en-AU/library/cc280604.aspx

Pamar9y ago

The latter is a sort of "inheritance" done in SQL and I had used it with good results in the past.

tracker19y ago

For me a great example is additional/diagnostic information related to other systems...

There are other considerations as well, in some cases a document store such as mongo or rethinkdb may be closer to what you ideally want, for that matter you may need something closer to Cassandra.

susan_hall9y ago

Space constraints can be a big problem with column keys, at sufficient scale.

Traditional SQL database have column names that are stored separately from the table data, but in JSON, the keys are in the data.

username : mary

but as JSON gets big, this needs to change to:

u : mary

Most of the companies that I know, if they work with JSON document stores at scale (such as MongoDB) eventually take steps to make the keys smaller.

vog9y ago

I was also wondering about this one.

This may make sense if you have custom fields, i.e. the set of keys is user-defined and ever increasing.

Apart from that, there is no reason to do this. If it is all about avoiding having "too many" columns and/or "too may" null values, then I'd say: Don't worry. Just use as many columns as you need.

I would even go one step further and say: It is a common anti-pattern to introduce generic key-value stores (through a separate table, JSON structures or XML structures) without a compelling reason.

inopinatus9y ago

I'll add another reason: subtle application issues creep in with value states.

You might question whether it is wise to have a "boolean" key/value pair with four semantically distinct states. I can only say I have done this and I am not proud of it.

1 more reply

vog9y ago

I had to work with such a design in a real-world project and it was really annoying.

2 more replies

colanderman9y ago

1 more reply

malisper9y ago

At Heap, we allow users to send custom event properties through our API. Since we don't know what properties users will send in advance, we need to use something like JSONB to store them.

leothekim9y ago

2 more replies

wvenable9y ago

That's perfectly reasonable. But then you also let them query on those arbitrary custom properties and that's where the performance issues are? If so, that's a fairly hard problem to solve.

Taking the well-defined subset of searchable properties and making them columns, as described in the article, is the really the best solution.

1 more reply

merb9y ago

actually we have a table which would contain over 100 fields. while we only need to query 10.

actually we store only 6 things inside the table and the rest inside the jsonb.

colanderman9y ago

You can index the contents of lists in Postgres using GIN indexes.

mpolichette9y ago

I would think of any system where users can store additional 'custom fields.' In these cases, it is impossible to include columns for these as they aren't known yet.

simiano9y ago

Keep in mind that Heap collects a huge amount of data and has a huge dataset. I like this[1] talk, it gives you an overview of their architecture. Great post though, thank you for sharing.

[1] https://www.youtube.com/watch?v=NVl9_6J1G60

IsmaOlvey9y ago

It is much better for data that is stored once and then queried.

vog9y ago

> It has no way of knowing, for example, that record ->> 'value_2' = 0 will be true about 50% of the time

Can't this be solved by introducing an expression index[1] for "record ->> 'value_2'"?

This would add a specialized index that will be used of all queries that have a filter like "WHERE record ->> 'value_2' = 0 AND ...".

[1] https://www.postgresql.org/docs/current/static/indexes-expre...

drob9y ago

parenthephobia9y ago

Postgres stores statistics for expression indexes, so it can know that the predicate is true for half the rows.

(This is Postgres 9.5)

1 more reply

warmwaffles9y ago

Gee go figure that using native columns is faster and better to query over.

buremba9y ago

In fact, JSONB is probably not a good idea when it comes to analytics.

sgt9y ago

Most of our tables have a uuid, a jsonb entity along with relationships stored as uuid columns with a _fk suffix. We then index the FK's.

silverlight9y ago

Does adding a Gin index to the JSONB help this?

malisper9y ago

throwa9y ago

manigandham9y ago

Why dont document databases automatically save common keys in some kind of lookup table?

Seems like a basic feature to improve space savings and processing speed.

hendzen9y ago

It's much simpler to just use an off-the-shelf compression (zlib, lz4, etc) on the database pages. This basically has the same effect, but also compresses common values.

malisper9y ago

manigandham9y ago

In that case, why isnt that normal and key name size a non-issue with document stores and JSON columns?

Am I missing something on why this isnt done already and automatically?

2 more replies

tracker19y ago

IMHO anything that is to be queried against regularly should be normalized into an actual column.

manigandham9y ago

palmdeezy9y ago

Hey SQL newbie question: why use JSONB when you could split out tables into `user` and `user_meta`? isn't that how Wordpress works?

bruce_one9y ago

It all depends on what you're doing...

palmdeezy9y ago

Ahhhh so this will be super useful if I want to keep track of transaction events from a third party (like Shopify or Stripe). I could just keep table with `ID` `user_id` `time` `blob`.

rhinoceraptor9y ago

pgaddict9y ago

And why is this a headache? Every time a new feature is introduced, you have to worry about the PostgreSQL version.

drob9y ago

We have a bag of utils internally to paper over the missing JSONB functions. This was definitely a headache at first.

This is mostly fixed in 9.5: http://blog.2ndquadrant.com/jsonb-and-postgresql-9-5-with-ev...

edoceo9y ago

I used JSONB to keep the Tags on an object. Used to be EAV. Works way better here

crorella9y ago

I usually use JSONB but keep a non jsonb column as the PK.

knucklesandwich9y ago

[1]: https://www.postgresql.org/docs/current/static/runtime-confi...

pmontra9y ago

Anyway, what did you do? Did you keep using JSONB with that planner config setting or did you extract some data to ordinary columns?

knucklesandwich9y ago

malisper9y ago

> Protip: Use the planner config settings[1] (one of which is mentioned in this article) with SET LOCAL in a transaction

We've wanted to do this but the last I checked, Citus, the software we use to shard our postgres databases, isn't able to handle setting configs in a query.

knucklesandwich9y ago

agentgt9y ago

Now querying for those custom events is extremely fast and you can query from the base event type or the subtable directly.

[1]: https://www.postgresql.org/docs/current/static/ddl-partition...

1 more reply

kevincox9y ago

The nice thing about using postgres is that when you realize "Mongo style" was a bad idea you can write some migrations and take your common fields into proper columns quite easily.

j / k navigate · click thread line to collapse