One of the reasons I don't like ORMs is that I'm not able to see the underlining query and truly optimize a service. That may be fine for a new service where performance isn't crucial, but once it needs to scale, you need to put on your engineering hat, get your hands dirty, and optimize queries.
You'll find you need to re-write queries so that there isn't complex nesting in the WHERE statement and flatten your logic so that the SQL optimizer can use your indexes. You may need to put SELECT statements within SELECT statements, where the innermost SELECT uses indexes and the outer queries are using the result of the inner query, which is smaller than the whole table.
I feel that SQL aimed to be Python and became x86 assembly instead. It's no longer a simple "just works" query language the moment you have to worry about predicate flattening, join decomposition, CTEs that introduce optimization barriers, and "IN()" being faster than equivalent "JOINs".
As a result, I started a project that allows you to write read-only database-agnostic queries called GraphQL compiler: https://graphql-compiler.readthedocs.io/ https://github.com/kensho-technologies/graphql-compiler
The core idea of the project is to get us the convenience of specifying the "what question I want answered," but without the inconvenience of "how is the answer computed / with which specific set of queries / where did the data come from?" -- unless you want to peek under the hood, of course. All the visibility into the nitty-gritty details available on demand, but without the tedium of having to hand-optimize queries and know all the "magic" ways in which queries get faster or slower for each individual kind of database.
So...exactly like SQL, then?
“The GraphQL compiler turns read-only queries written in GraphQL syntax to different query languages” is useful for GraphQL ‘fans’, but I don’t see how that is going to solve that.
Designing a user friendly query language for relational data isn’t the hard part. Executing such queries efficiently is.
For SQL, there’s half a century of research on that. This paper is part of it, and indicates that, at this moment in time, effort is better spent on methods for keeping statistics on data up to date than on making cost models more fine grained.
I thought most of them had some feature where you could dump the query before it gets sent to the DB.
Stuff like this: https://stackoverflow.com/questions/1412863/how-do-i-view-th...
If you're looking at the generated SQL I would rather just use the SQL directly in my code. There's probably features in ORMs where you can write raw SQL and tell it how to map the result to an object but I haven't used an ORM in a while.
The vast majority of your queries will not fall into the category of "bottlenecks that need to be optimized" though, and you (and your probably more inexperienced team) will benefit massively from the less error-prone & more extensible nature of ORMs (I never again want to have to deal with an attempt at SQL code reuse that has grown into a string-formatted, quadruple-manifestation, triple-escaped nightmare)
A good ORM will also ease the transition into more manual SQL too, so that you can still retain the benefits of e.g. uniform abstract objects app-side.
You can. I mentioned EXPLAIN in my comment.
And the query planner isn’t a black box. Once you read the documentation on how the order of operations is determined by the engine, you can start to be thinking on the same plane as the query engine. You can infer how a query will use indices and the way the WHERE clause will be used.
Admittedly it’s not as easy as using an ORM, but if you’re a SQL expert then you can make queries much more optimized. You’ll never internalize how a SQL interpreter reads your queries unless you do it.
I’m talking about huge tables that are hit many times a second, where you need to start thinking like a Formula One team, being creative with queries to shave off hundreds of milliseconds.
Until its not and then I want hints to save my ass, and they are not hints, I want want to TELL the f'ing computer what to do because I know better than the optimizer period.
So surprised to find out PG doesn't support hints don't think I will ever be able to move anything serious until it does, just not going to take that kind of risk.
I have played the whole rewrite query to try and convince the optimizer what to do with barrier tricks, no thanks, give me some hints and I will tell it exactly what to do when thanks.
Often times PG won’t bother with an index for a variety of reasons (sequential scans can be legitimately faster in some scenarios), especially when the number of rows is small.
It's not just about index usage, its also which type of join (loop, hash or merge) and join order.
Can anyone comment on how relevant this is with the enhanced statistics types in Postgres 10, 11, 12?
One of the most useful areas for future improvement is making plans more robust against misestimations during execution, for example by using techniques like role-reversal during hash joins, or Hellerstein's "Eddies".
So, the paper definitely talks about how independent column statistics are a problem with big tables in the default stats configuration.
...But the option of creating correlated, non-independent column statistics did not exist in PG until after this paper. Which was my point.
In my experience, flat out increasing statistics sample rates fixes 80%+ of the problems in this paper, with basically no downsides. (You can push that computation to downtime when no-one cares.)
> ...the most important statistic for join estimation in PostgreSQL is the number of distinct values. These statistics are estimated from a fixed-sized sample, and we have observed severe underestimates for large tables.
Live statistics, incrementally updated on DML execution, is a key feature for a good query optimizer. As a zero-administration RDBMS, SQL Anywhere had gained a reputation as a best-of-breed query optimizer [1] a decade ago; I'm curious if this still holds true.
In the last decade, the importance of OLAP queries in row stores has diminished due to the superiority of column stores. I'd be interested in a comparison of the Citus query optimizer vs. say Presto.
[1] https://www.student.cs.uwaterloo.ca/~cs448/W11/cs448_Paulley...
Another post in this thread mentions adaptive query planning and mistakenly imply that the GEQO is a module for this. My hands have been itching to look into experimenting on some improvements on the GEQO, specifically by improving the genetic algorithms used. When there are many similar queries, so in the adaptive query planning setting, one could also use reinforcement learning to improve query planning over time.
Optimizers are weird.