This is a standard story: A query ran for a long time without issue, and then, one day, some minor shift in your statistics happens, and now you have a major performance issue in your hands, without any real change on prod. No non-productive environment can help you: They don't have the same data or they do have the same data but the statistic sampling was slightly different.
If it happens, you have to find an incantation that makes the planner comprehend what's going wrong. Postgresql has things like the statistic object and in this case the statistic column property, but finding the right incantation can be black magic, and adding indexes/statistics/... can take hours so trial and error is slow.
Dumber databases have an edge here: Their performance is probably lower, but it is predictable.
Some RDBMS have mitigations, e.g. Oracle's optimizer plan stability allows you to make the plan unchangeable. It's a 2-sided knife of course: It won't get better if the data has a chance for it, but it won't get worse either.
> No non-productive environment can help you: They don't have the same data or they do have the same data but the statistic sampling was slightly different.
GoCardless still has a massive Postgres database (10TB or there-abouts) and only managed to scale it by investing heavily in tooling that helps developers work with it safely.
One example is https://github.com/gocardless/draupnir, a tool to create instances of production datasets very quickly (just `eval $(draupnir new); psql` and you have a mini production in ~3s) so you could try things like adding indexes, tweaking the plan settings (`set enable_seq_scan='off'`) and reshaping the data to see how your planner behaved.
I think it's very doable, though the planner still has blindspots. I had a side project to add linear correlation statistics to the planner that I abandoned when I stopped working with big Postgres databases, but that's an example of statistics that Postgres just doesn't track but lead to these pathological edge cases.
I'd rather have the clever planner than not, though. I've a healthy appreciation for the heavy lifting Postgres can do for you.
But it's not all roses and rainbows. Draupnir seem cool, but it can't help you avoid the problem, only fix it faster.
At the core, there is a trade off here: Performance for predictability. You see the same thing in compilers, JITs and sometimes even processor cores. There is an optimizer in these things, that works 99% of the time, and makes things clearly better than human effort alone allows. But once in a while it guesses wrong, and you fall off a performance cliff.
Meanwhile, predictability is valuable in production, even to the point you might want to trade a serious dent in your performance for it.
Every change to our prod DB requires running EXPLAIN and EXPLAIN ANALYZE on some data to make sure the queries are doing the right thing (we use GCP Query Insights to watch for regressions [2]).
The cast majority of our queries are single index scans. I wish there was a database that we could fix the plan when our app is deployed. For the most part our schema is fairly denormalized so we don't need very complex queries. The flexibility/power of SQL is really for debugging, analytics and other one off queries.
Hot take: I wish there was a DB that didn't force SQL. At least for the application, instead you just told it what scan you wanted to do (basically the embed plan directly in the query you send). There could be a reporting mechanism if the DB detected a more efficient plan for the query or something. You could still have a SQL layer for your debug and one off sessions.
I would vastly prefer the predictability over occasional performance spikes or in our case a spike of transaction failures due to a predicate lock being grabbed for a whole table.
[1]: the default here is 32 rows (https://www.postgresql.org/docs/current/runtime-config-locks...)
[2]: https://cloud.google.com/sql/docs/postgres/using-query-insig...
[1]: https://dev.mysql.com/doc/refman/8.0/en/index-hints.html [2]: https://wiki.postgresql.org/wiki/OptimizerHintsDiscussion
I know what you mean, but I don't think that that quite captures it. It's more like this: planners are built on a set of assumptions that are often pretty far from robust, but nevertheless work adequately well in almost all cases. Including many cases where the assumptions haven't been met!
The best example is the standard assumption that multiple conditions/columns are independent of each other -- all optimizers make this assumption (some can be coxed into recognizing specific exceptions). This is obviously not true much of the time, even with a well normalized schema. Because: why would it be?
All kinds of correlations naturally appear in real data. It's just that it mostly doesn't cause huge problems most of the time, for messy reasons that can't quite be pinned down. You have to get unlucky; the correlations usually have to be very high, and the planner makes completely the wrong inference for the actual query that you ran (not some hypothetical other query). The planner only has to have approximately the right idea to discover the cheapest plan. And the planner doesn't have to discover the cheapest plan in many cases -- there may be quite a few adequate plans (it's really hard to generalize, but that's often true).
Overall, the fact that cost-based optimizers work as well as they do seems quite surprising to me.
That's simply not true, it's just less noticeable. Because even if your query plan is not changing, your data is. There will always be some point where your data grows and a reasonable planner (whether you or your database) has to adapt to that as it grows. For example, if a small lookup table grows enough, it stops being faster to do a full table scan on that table, and it becomes reasonable to do an index lookup. If your plan never changes, your performance gets worse. You may argue that fixed query plans are more predictable, but they are not objectively better.
I haven't used it yet, but it sounds like that gives you the cake and lets you eat it too.
This is the precise problem I’m working on solving. See the pg_plan_guarantee extension.
I don't like the interface, however. As you have to wrap the query with custom markers $pgpg$, you can't use it on anything that programatically generates the query, like an ORM.
I'd prefer an interface where you have a table that maps the query (hash?) to a plan. Then create a stored procedure e.g. nail_plan('SELECT blah blah blah') that inserts a record in that table. You can then backup and restore plans, easily query what plans are guaranteed, maybe even migrate plans between dev and prod. Table could also mark which plans are now invalid.
Upgrading major versions and huge surges of writes are often the catalyst of crossing performance cliffs, IME. The first can be planned for and since load testing can help get ahead of the second.
If you like this content, there's a bunch more on my blog that you'll probably like too:
https://blog.lawrencejones.dev/
While I continue to work with Postgres, I'm enjoying the life of an early start-up where your default is "not much data", though it does make you unlearn a lot of the habits developed when dealing with multi-terabyte databases!
> large, heterogenous datasets. When all the rows look similar there will be fewer candidate plans and less chance of a disasterous mis-step.
I'm no longer at the company so can't fix it, apologies!