How good are query optimizers, really? [pdf] (2015) (opens in new tab)

(vldb.org)

70 pointsguodong5y ago56 comments

56 comments

One of the most important things I learned with databases was to run each of my queries using EXPLAIN (EXPLAIN QUERY PLAN in sqlite) and seeing which indexes are being used, if any.

One of the reasons I don't like ORMs is that I'm not able to see the underlining query and truly optimize a service. That may be fine for a new service where performance isn't crucial, but once it needs to scale, you need to put on your engineering hat, get your hands dirty, and optimize queries.

You'll find you need to re-write queries so that there isn't complex nesting in the WHERE statement and flatten your logic so that the SQL optimizer can use your indexes. You may need to put SELECT statements within SELECT statements, where the innermost SELECT uses indexes and the outer queries are using the result of the inner query, which is smaller than the whole table.

obi1kenobi5y ago

I wrote a thread on this on Twitter: https://twitter.com/PredragGruevski/status/12639165990625402...

I feel that SQL aimed to be Python and became x86 assembly instead. It's no longer a simple "just works" query language the moment you have to worry about predicate flattening, join decomposition, CTEs that introduce optimization barriers, and "IN()" being faster than equivalent "JOINs".

As a result, I started a project that allows you to write read-only database-agnostic queries called GraphQL compiler: https://graphql-compiler.readthedocs.io/ https://github.com/kensho-technologies/graphql-compiler

The core idea of the project is to get us the convenience of specifying the "what question I want answered," but without the inconvenience of "how is the answer computed / with which specific set of queries / where did the data come from?" -- unless you want to peek under the hood, of course. All the visibility into the nitty-gritty details available on demand, but without the tedium of having to hand-optimize queries and know all the "magic" ways in which queries get faster or slower for each individual kind of database.

petergeoghegan5y ago

> The core idea of the project is to get us the convenience of specifying the "what question I want answered," but without the inconvenience of "how is the answer computed / with which specific set of queries / where did the data come from?"

So...exactly like SQL, then?

1 more reply

Someone5y ago

It’s a leaky abstraction, yes. In the ideal world, SQL query planners always would find the optimal plan in zero seconds and DDL would allow you to express non-functional requirements for various kinds of queries (“retrieving this record by ID should take at most 1ms”, “looking up related records in that table by order number should…”), but the world isn’t ideal. We have to manually create indices, split tables, move different tables to different disks, etc. for performance.

“The GraphQL compiler turns read-only queries written in GraphQL syntax to different query languages” is useful for GraphQL ‘fans’, but I don’t see how that is going to solve that.

Designing a user friendly query language for relational data isn’t the hard part. Executing such queries efficiently is.

For SQL, there’s half a century of research on that. This paper is part of it, and indicates that, at this moment in time, effort is better spent on methods for keeping statistics on data up to date than on making cost models more fine grained.

1 more reply

vsareto5y ago

>One of the reasons I don't like ORMs is that I'm not able to see the underlining query and truly optimize a service.

I thought most of them had some feature where you could dump the query before it gets sent to the DB.

Stuff like this: https://stackoverflow.com/questions/1412863/how-do-i-view-th...

devmunchies5y ago

yeah but if the generated SQL doesn't look how you want it, now you gotta optimize it using the ORM's language and not SQL. Tweak the ORM, see the generated SQL, tweak again... etc.

If you're looking at the generated SQL I would rather just use the SQL directly in my code. There's probably features in ORMs where you can write raw SQL and tell it how to map the result to an object but I haven't used an ORM in a while.

1 more reply

rst5y ago

That depends on the ORM -- ActiveRecord has a ".explain" method on the proxy objects for its query builder which displays the SQL, and the query plan (displayed in db-specific format).

why-el5y ago

Note that AR's `explain` actually runs the query (effectively an explain analyze). This might or might not matter depends on what you are trying to achieve.

ris5y ago

> One of the reasons I don't like ORMs is that I'm not able to see the underlining query and truly optimize a service.

The vast majority of your queries will not fall into the category of "bottlenecks that need to be optimized" though, and you (and your probably more inexperienced team) will benefit massively from the less error-prone & more extensible nature of ORMs (I never again want to have to deal with an attempt at SQL code reuse that has grown into a string-formatted, quadruple-manifestation, triple-escaped nightmare)

A good ORM will also ease the transition into more manual SQL too, so that you can still retain the benefits of e.g. uniform abstract objects app-side.

lmm5y ago

This kind of thing is the reasons why I do like ORMs. Even if you can see the SQL, that doesn't mean you can tell what it's doing: you don't know what indices will be used or not used, you don't know why the query planner will do things one way or another, you're not actually "close to the metal" at all. So might as well have the convenience of an ORM; sure, you'll have to do some fiddly profiling as and when you have performance problems, but even if you were using hand-written SQL you'd still have to do that.

devmunchies5y ago

> Even if you can see the SQL, that doesn't mean you can tell what it's doing

You can. I mentioned EXPLAIN in my comment.

And the query planner isn’t a black box. Once you read the documentation on how the order of operations is determined by the engine, you can start to be thinking on the same plane as the query engine. You can infer how a query will use indices and the way the WHERE clause will be used.

Admittedly it’s not as easy as using an ORM, but if you’re a SQL expert then you can make queries much more optimized. You’ll never internalize how a SQL interpreter reads your queries unless you do it.

I’m talking about huge tables that are hit many times a second, where you need to start thinking like a Formula One team, being creative with queries to shave off hundreds of milliseconds.

1 more reply

SigmundA5y ago

I have really tried to let the optimizer do its thing and generally it does and everything's ok.

Until its not and then I want hints to save my ass, and they are not hints, I want want to TELL the f'ing computer what to do because I know better than the optimizer period.

So surprised to find out PG doesn't support hints don't think I will ever be able to move anything serious until it does, just not going to take that kind of risk.

I have played the whole rewrite query to try and convince the optimizer what to do with barrier tricks, no thanks, give me some hints and I will tell it exactly what to do when thanks.

ninkendo5y ago

Is the query actually slower, or is it just not using an index you want it to use?

Often times PG won’t bother with an index for a variety of reasons (sequential scans can be legitimately faster in some scenarios), especially when the number of rows is small.

stubish5y ago

The cool thing about hints is you can quickly drop in a hint to confirm your suspicions and narrow down the problem, rather than trying to do this sort of diagnosis in a vacuum. But because some people use hints for evil, nobody is allowed to use them.

1 more reply

SigmundA5y ago

The query is slower otherwise I would not even notice, I don't care which index it uses if its fast.

It's not just about index usage, its also which type of join (loop, hash or merge) and join order.

tornato75y ago

One big reason I appreciate Couchbase is it's USE INDEX functionality

abernard15y ago

This paper is from 2015 it appears.

Can anyone comment on how relevant this is with the enhanced statistics types in Postgres 10, 11, 12?

__s5y ago

Here's a neat extension that tries to use genetic algorithms to learn better planning for one's queries, includes slides which cite this & have TPC numbers

https://www.pgcon.org/2017/schedule/events/1086.en.html

jzelinskie5y ago

I can't comment on your question, but thank you for finding the year this paper was written. Having read many older papers, it's sometimes like solving a murder mystery figuring out what year a paper was written. The year a paper is written is vital for understanding social context of the research being presented in addition to any context cited in the paper.

nighthawk4545y ago

Agreed, especially in fast-changing fields or after recent breakthroughs. One trick I use is look at the References and find the approx. max year cited. Generally the same or pretty close to the year of the paper.

ergl5y ago

With CS conference papers, it's quite easy to see the date on the bottom-left of the first page. You see the copyright year, conference name, etc

Ar-Curunir5y ago

Isn’t it usually a matter of just googling the title?

jasonwatkinspdx5y ago

This has long been a pet peeve of mine.

petergeoghegan5y ago

Most of that stuff depends on explicit CREATE STATISTICS commands being run in order to work around column correlations and stuff like that. The general assumption of independence among columns/attributes is pretty universal (as the paper actually says).

One of the most useful areas for future improvement is making plans more robust against misestimations during execution, for example by using techniques like role-reversal during hash joins, or Hellerstein's "Eddies".

abernard15y ago

> The general assumption of independence among columns/attributes is pretty universal (as the paper actually says).

So, the paper definitely talks about how independent column statistics are a problem with big tables in the default stats configuration.

...But the option of creating correlated, non-independent column statistics did not exist in PG until after this paper. Which was my point.

In my experience, flat out increasing statistics sample rates fixes 80%+ of the problems in this paper, with basically no downsides. (You can push that computation to downtime when no-one cares.)

sradman5y ago

I like the methodology of the Join Order Benchmark (JOB). The key takeaway is PostgreSQL specific:

> ...the most important statistic for join estimation in PostgreSQL is the number of distinct values. These statistics are estimated from a fixed-sized sample, and we have observed severe underestimates for large tables.

Live statistics, incrementally updated on DML execution, is a key feature for a good query optimizer. As a zero-administration RDBMS, SQL Anywhere had gained a reputation as a best-of-breed query optimizer [1] a decade ago; I'm curious if this still holds true.

In the last decade, the importance of OLAP queries in row stores has diminished due to the superiority of column stores. I'd be interested in a comparison of the Citus query optimizer vs. say Presto.

[1] https://www.student.cs.uwaterloo.ca/~cs448/W11/cs448_Paulley...

flooo5y ago

I'm surprised there is no mention of Postgres' Genetic Query Optimizer (GEQO) here. It kicks in when there is a large number of joins and reduces query planning time at the cost of query execution time.

Another post in this thread mentions adaptive query planning and mistakenly imply that the GEQO is a module for this. My hands have been itching to look into experimenting on some improvements on the GEQO, specifically by improving the genetic algorithms used. When there are many similar queries, so in the adaptive query planning setting, one could also use reinforcement learning to improve query planning over time.

jzoch5y ago

The simplified cost model they use was really surprising. 34% better than the complex pg one not only sounds great (incoming "simple is better" replies below) but is really nice to hear. Hopefully postgres has or will consider changing the default cost model to a simpler, more modern function that takes the current landscape into account.

petergeoghegan5y ago

That may be true, but that doesn't seem like the important takeaway to me. The important takeaway is "In contrast to cardinality estimation, the contribution of the cost model to the overall query performance is limited". Actually, the paper itself says "This improvement [the 34% one you mention] is not insignificant, but on the other hand, it is dwarfed by improvement in query runtime observed when we replace estimated cardinalities with the real ones".

Optimizers are weird.

jzoch5y ago

Oh definitely agree. Mainly I find the cost model interesting because it’s so simple and contained. Cardinality estimation is a hard problem and requires real expertise. But the easy wins you get by just throwing out something based on old assumptions like the cost model is fun!

j / k navigate · click thread line to collapse

56 comments

devmunchies5y ago

One of the most important things I learned with databases was to run each of my queries using EXPLAIN (EXPLAIN QUERY PLAN in sqlite) and seeing which indexes are being used, if any.

obi1kenobi5y ago

I wrote a thread on this on Twitter: https://twitter.com/PredragGruevski/status/12639165990625402...

petergeoghegan5y ago

So...exactly like SQL, then?

1 more reply

Someone5y ago

“The GraphQL compiler turns read-only queries written in GraphQL syntax to different query languages” is useful for GraphQL ‘fans’, but I don’t see how that is going to solve that.

Designing a user friendly query language for relational data isn’t the hard part. Executing such queries efficiently is.

1 more reply

vsareto5y ago

>One of the reasons I don't like ORMs is that I'm not able to see the underlining query and truly optimize a service.

I thought most of them had some feature where you could dump the query before it gets sent to the DB.

Stuff like this: https://stackoverflow.com/questions/1412863/how-do-i-view-th...

devmunchies5y ago

yeah but if the generated SQL doesn't look how you want it, now you gotta optimize it using the ORM's language and not SQL. Tweak the ORM, see the generated SQL, tweak again... etc.

1 more reply

rst5y ago

That depends on the ORM -- ActiveRecord has a ".explain" method on the proxy objects for its query builder which displays the SQL, and the query plan (displayed in db-specific format).

why-el5y ago

Note that AR's `explain` actually runs the query (effectively an explain analyze). This might or might not matter depends on what you are trying to achieve.

ris5y ago

> One of the reasons I don't like ORMs is that I'm not able to see the underlining query and truly optimize a service.

A good ORM will also ease the transition into more manual SQL too, so that you can still retain the benefits of e.g. uniform abstract objects app-side.

lmm5y ago

devmunchies5y ago

> Even if you can see the SQL, that doesn't mean you can tell what it's doing

You can. I mentioned EXPLAIN in my comment.

I’m talking about huge tables that are hit many times a second, where you need to start thinking like a Formula One team, being creative with queries to shave off hundreds of milliseconds.

1 more reply

SigmundA5y ago

I have really tried to let the optimizer do its thing and generally it does and everything's ok.

Until its not and then I want hints to save my ass, and they are not hints, I want want to TELL the f'ing computer what to do because I know better than the optimizer period.

So surprised to find out PG doesn't support hints don't think I will ever be able to move anything serious until it does, just not going to take that kind of risk.

I have played the whole rewrite query to try and convince the optimizer what to do with barrier tricks, no thanks, give me some hints and I will tell it exactly what to do when thanks.

ninkendo5y ago

Is the query actually slower, or is it just not using an index you want it to use?

Often times PG won’t bother with an index for a variety of reasons (sequential scans can be legitimately faster in some scenarios), especially when the number of rows is small.

stubish5y ago

1 more reply

SigmundA5y ago

The query is slower otherwise I would not even notice, I don't care which index it uses if its fast.

It's not just about index usage, its also which type of join (loop, hash or merge) and join order.

tornato75y ago

One big reason I appreciate Couchbase is it's USE INDEX functionality

abernard15y ago

This paper is from 2015 it appears.

Can anyone comment on how relevant this is with the enhanced statistics types in Postgres 10, 11, 12?

__s5y ago

Here's a neat extension that tries to use genetic algorithms to learn better planning for one's queries, includes slides which cite this & have TPC numbers

https://www.pgcon.org/2017/schedule/events/1086.en.html

jzelinskie5y ago

nighthawk4545y ago

ergl5y ago

With CS conference papers, it's quite easy to see the date on the bottom-left of the first page. You see the copyright year, conference name, etc

Ar-Curunir5y ago

Isn’t it usually a matter of just googling the title?

jasonwatkinspdx5y ago

This has long been a pet peeve of mine.

petergeoghegan5y ago

abernard15y ago

> The general assumption of independence among columns/attributes is pretty universal (as the paper actually says).

So, the paper definitely talks about how independent column statistics are a problem with big tables in the default stats configuration.

...But the option of creating correlated, non-independent column statistics did not exist in PG until after this paper. Which was my point.

In my experience, flat out increasing statistics sample rates fixes 80%+ of the problems in this paper, with basically no downsides. (You can push that computation to downtime when no-one cares.)

sradman5y ago

I like the methodology of the Join Order Benchmark (JOB). The key takeaway is PostgreSQL specific:

In the last decade, the importance of OLAP queries in row stores has diminished due to the superiority of column stores. I'd be interested in a comparison of the Citus query optimizer vs. say Presto.

[1] https://www.student.cs.uwaterloo.ca/~cs448/W11/cs448_Paulley...

flooo5y ago

jzoch5y ago

petergeoghegan5y ago

Optimizers are weird.

jzoch5y ago

j / k navigate · click thread line to collapse