undefined | Better HN

0 pointsdarksaints5y ago0 comments

If you know all of the quirks of the various databases, why aren't you hacking on their optimizers? Why write a compiler that knows what's slow and what's not when you can just fix what's slow?

0 comments

obi1kenobi5y ago

It's an "and" rather than an "either-or" :)

When the database is open-source, and I spot something that's broken that I know how to fix, I try to fix it. Here's a fix for a severe database query planner correctness bug I contributed to an open-source database called OrientDB: https://github.com/orientechnologies/orientdb/pull/7015

Unfortunately, Microsoft SQL Server, Oracle, and many other databases are not open-source, and I can't hack on their query planners. And even if they were, SQL is an absolutely massive language (the complete spec is 100,000+ pages). The GraphQL compiler query language is tiny in comparison, the spec is maybe 10 pages: https://graphql-compiler.readthedocs.io/en/latest/language_s...

It's a lot easier to intelligently map a small language to a big one than it is to optimize the big language outright.

In a sense, SQL is just not designed to be easy to optimize — it's too broad, and there are too many equivalent-ish ways of doing the same thing. This is why even after incredibly smart people cumulatively spent engineer-millennia on the query execution and optimization systems in SQL databases, we still keep having issues and there are still plenty of areas for improvement.

More info and more concrete examples of "why not just write SQL" in my upcoming blog post!

petergeoghegan5y ago

> In a sense, SQL is just not designed to be easy to optimize — it's too broad, and there are too many equivalent-ish ways of doing the same thing. This is why even after incredibly smart people cumulatively spent engineer-millennia on the query execution and optimization systems in SQL databases, we still keep having issues and there are still plenty of areas for improvement.

The main reason is the inherent difficulty of cardinality estimation, as the paper says.

Not every optimization is worth having. There is typically a distributed cost, paid in extra planner cycles for queries that don't benefit from the optimization. This is one of the main reasons why it's hard to contribute new optimizations to the Postgres planner. Naturally, it's possible that a marginal optimization will be incredibly important to one particular query or user. It's a value judgement in each case.

Frankly, I find the suggestion that SQL is not designed to be easy to optimize baffling.

obi1kenobi5y ago

I think we agree more than may seem apparent at first glance. In a sense, you are also making the same point I was trying and probably failed to make. Please bear with me as I give it another shot.

The difficulty of cardinality estimation is a function of the expressiveness of the language. Imagine a new query language, SQL2, that only has the SELECT and FROM keywords -- no WHERE, no JOIN, nothing else. Cardinality estimation in SQL2 is trivial: just store the counts for each table, and you can estimate everything trivially. Optimal query plans are trivial by extension as well.

Now let's add the WHERE keyword and a few operators to this SQL2. Cardinality estimation and good query planning got much harder now! For example, if the WHERE predicate touches two columns, we need to know about correlations between the two columns, or we might make incorrect estimates and therefore get worse plans. And since the plan space got bigger, we spend more cycles on planning. If we continue to add more features to SQL2 to bring it to parity with SQL proper, all these problems get harder as we go.

The language semantics behind GraphQL compiler aim to get sufficient expressiveness for many (hopefully, most) use cases, while limiting the scope so that the problems of cardinality estimation and good planning don't become too hard to solve effectively. In comparison, SQL is significantly more expressive, and as a result also much more difficult to execute and optimize.

j / k navigate · click thread line to collapse

0 comments

obi1kenobi5y ago

It's an "and" rather than an "either-or" :)

It's a lot easier to intelligently map a small language to a big one than it is to optimize the big language outright.

More info and more concrete examples of "why not just write SQL" in my upcoming blog post!

petergeoghegan5y ago

The main reason is the inherent difficulty of cardinality estimation, as the paper says.

Frankly, I find the suggestion that SQL is not designed to be easy to optimize baffling.

obi1kenobi5y ago

I think we agree more than may seem apparent at first glance. In a sense, you are also making the same point I was trying and probably failed to make. Please bear with me as I give it another shot.

j / k navigate · click thread line to collapse