Common Mistakes and Missed Optimization Opportunities in SQL (opens in new tab)

(hakibenita.com)

179 pointshaki6y ago66 comments

66 comments

Svip6y ago

While counting columns will not include NULL columns, how about counting joined tables?

  SELECT a.id, COUNT(b.*) FROM a JOIN b ON b.a_id = a.id GROUP BY a.id

is not permitted in Postgres.

Sure, I could just use COUNT(b.a_id) since that's what I join on, but a more complicated example might not allow for that. For instance if it was a virtual table.

gnud6y ago

I'm sorry, if you want to count NULL-rows in b.* , how will it ever be different from just COUNT( * )? Maybe I'm misunderstanding what you're after?

Svip6y ago

You're right, it's a bad example, imagine if I joining two tables:

  SELECT a.id, COUNT(b.*), COUNT(c.*)
    FROM a JOIN b ON b.a_id = a.id JOIN c ON c.a_id = a.id
    GROUP BY a.id

I want to know how many occurrences a_id has in both table b and c. Again in this simple example, I could just count on b.a_id and c.a_id, respectively, but imagine if b and c were complex virtual tables:

  JOIN (SELECT NULL AS foo, 1 AS bar 
        UNION SELECT 1 AS foo, NULL AS bar) b ON b.foo = a.id OR b.bar = a.id

This would be useful if we are aggregating data together, where essentially, there are two ways to join the data with the main table, and both columns can be null.

Of course, in this example, you could count by going COUNT(b.foo) + COUNT(b.bar), but that's a bit awkward, or a column in table b you know to never be null. But what if you don't? And still have table c next to it?

Yes, in all cases, there would be a way out. In the extreme case, you could wrap it in a virtual table, where you add a column that is just always 0 (not null), so you can count on it. It would just be neat if b.* was possible.

2 more replies

andreareina6y ago

Any reason a regular count() wouldn't work?
SELECT a.id, count(
) FROM ...

Svip6y ago

It would, I realised after I read the replies, that I should have used the two joined table scenario as per my reply to your sibling.

adamiscool86y ago

Some of these have been learned through trial and error over the years, but a few were new and great to know.

On a related note, is the MCSE the gold standard for SQL education? Have been looking for a way to brush up and formalize my SQL skills.

Dowwie6y ago

Would someone please confirm whether this article is misrepresenting a subquery as an inline CTE? It is my understanding that as of Postgresql 12, a programmer denotes a CTE as "AS MATERIALIZED", "AS NOT MATERIALIZED", or neither and allow the default operation to happen: the CTE subquery will default to inline if its result is used once.

for reference: https://sudonull.com/posts/998-Important-changes-in-the-CTE-...

Generally speaking, some clarification would be helpful!

rgharris6y ago

I think the article and you are correct - the article is worded a little oddly though and ignores the fact that in Postgres 12 CTEs that are referenced multiple times are MATERIALIZED by default.

Before Postgres 12 CTEs were always materialized so you did not get any query optimization benefits of CTEs acting like inline subqueries.

After Postgres 12 all CTEs default to NOT MATERIALIZED if only referenced once or MATERIALIZED if referenced more than once. You can override via MATERIALIZED or NOT MATERIALIZED when defining the CTE.

Their example is showing that you can let Postgres (before 12) optimize a CTE for you by writing it as an inline subquery instead of a CTE:

  SELECT *
  FROM (
    SELECT *
    FROM sale
  ) AS inlined
  WHERE created_by_id = 1

But with Postgres 12 their "don't" example would result in an index scan without refactoring to the "do" example. Basically their advice on do vs don't applies to before Postgres 12.

https://www.postgresql.org/docs/12/queries-with.html is pretty thorough on this

Dowwie6y ago

Thanks for confirming

tempguy99996y ago

This is a pretty trivial list. Useful for beginners I guess.

I seriously take issue with "Reference Column Position in GROUP BY and ORDER BY" though. If it is restricted to ad-hoc (AKA messing-about) queries I'd be fine with it, but it won't be. Just don't do it.

wfriesen6y ago

It's especially egregious in the ORDER BY, since there you have the option of using column aliases.

commandlinefan6y ago

Are you saying you can't use column aliases in group by? What version of Postgres are you using? I just tried it in 11.5 and it worked:

    # select cust_id as c, sum(avail_balance) as b from account group by c order by b;

2 more replies

tempguy99996y ago

I always, always forgot what column aliases I can use where. Thanks for the reminder.

irrational6y ago

It's useful for people new to Postgres, since many of these things are particular to Postgres.

tempguy99996y ago

Not really. Most is pretty well standard SQL (CTE optimisation fence pre PG12 being one exception, and there are a couple more, but really it's mostly standard stuff).

2 more replies

godshatter6y ago

I'd never run across coalesce before. I usually end up doing nested NVL calls if I'm trying to find the first non-null in a series of expressions (I'm on Oracle, btw). I've now added this function to my toolbox.

oarabbus_6y ago

Coalesce and NVL are synonyms for each other.

godshatter6y ago

They don't seem to be in oracle. Giving more than two parameters to nvl gives me an error but works fine with coalesce. Granted they are basically the same thing if you are giving both two parameters.

Foobar85686y ago

I would add to the common mistakes (should be generic, but I have more xp with sql server) :

not indexing, most often, tables are not or poorly indexed.

Implicit conversion can generate a lot of io/leads to poor perf or just not using indexes.

Sql function:sorry but they are most often crap and useless, better to in-line or use TVF, and no its not code logic duplication.

Read uncommitted unless you enjoy not reading rows, multiple times or half of a value (page split and/or LOB values)

SigmundA6y ago

> Sql function:sorry but they are most often crap and useless, better to in-line or use TVF, and no its not code logic duplication.

Functions have helped me tremendously in SQL server, but you do have to know the issues, which can be taken advantage of to some degree.

Code reuse is the obvious use case, but due to lack of inlining up to SQL Server 2019 meant you could reduce performance compared to hand inlined case statement or whatever. Hopefully now that functions can be inlined in 2019 this will be a non issue going forward

They are an optimization barrier which can be a good thing. I have used to this my advantage to stabilize tricky queries that where using views for code reuse. The performance becomes consistent and predictable rather than going pathological on some databases even though it may be slightly slower on others.

pjungwir6y ago

Can an MS SQL Server expert explain why creating table variables inside sprocs is so popular? I've done lots of Postgres & MySQL (and long ago lots of Oracle), but now I find myself maintaining some MS SQL Server projects, and I see `DECLARE @intermediateResults` all over the place. I thought it was just the past developers being more familiar with imperative style than SQL, but it seems to be part of the broader MS SQL Server culture, and somewhere I came across some passing reference to read locking being a reason, but it was too brief for me to understand. Is there any reason to do this so often in MS SQL Server vs other RDMBSes?

2 more replies

castorp6y ago

SQL functions can be inlined in Postgres And Postgres does not support read uncommitted to begin with.

But I agree that implicit conversion is the root of a lot of evil

jve6y ago

> Read uncommitted unless you enjoy not reading rows, multiple times or half of a value (page split and/or LOB values)

Please elaborate.

Foobar85686y ago

https://www.mssqltips.com/sqlservertip/6072/sql-server-noloc...

The first time I experienced the lob issue was in 2009. At that time, I didn't know how to really reproduce the issue due to my lack of knowledge (which triggered also deep diving will to sql server internals _not my article,nor his but more generally Paul White /sql kiwi has published a load of great articles on internals, maybe some of the best technical articles I have read. )

esnard6y ago

In the "Avoid Transformations on Indexed Fields" part, I fail to understand how the example can work if you're applying the timezone computation on the right-hand side.

I'm not familiar with MS SQL (I've only worked with MySQL / PostgreSQL), can someone explain me how it works?

moron4hire6y ago

Your only failure is because it's just wrong. It's about the same as trying to change "if((a + b) > c)" to "if(a > (c + b))". If this weren't time zones, it'd obviously be "if(a > (c - b))", because you have to balance the equation by applying the same operation to boths sides. But because this is dealing with timezones, the offset of "b" is different depending on the value of "a", so we won't know what to subtract from "c" to get the right comparison. So the right transformation for this "gotcha" is not even possible.

paulclinger6y ago

I think the advice will still work, but you'd need to switch from "named" timezones to number-specific one, so for example replace `PST` with `-08:00` and then apply the opposite conversion on the right side (as you and I suggested).

paulclinger6y ago

I don't think it works the way author expects it to work, as the math is not correct. Think about `a+1 < 2` comparison. To remove +1, you need to change it to `a < 2-1`, not to `a < 2+1`; the operation needs to be transformed to the opposite one, which in this case would imply shifting the timezone in the opposite direction.

If you are asking about the timezone shift applied to a date, I think the engine converts the date to 00:00:00 timestamp and then does the timezone conversion.

jlarocco6y ago

I think the advice is correct, but the examples are not. When the transformation switches to the other side of the comparison it has to be inverted.

irrational6y ago

In regards to formatting sql, I used to do it the way shown, but a coworker formatted the columns in the select with the commas in front. This seemed strange to me until I tried it. I realized that this solved the problem of sometimes a query would be changed and the last item in the select list would be removed, but the last comma would not be removed. Or, a new item was added to the end of the select list, but they neglected to add in a comma at the end of the previous last item.

SELECT

  col1

  ,col2

  ,COUNT(col3)

FROM

  t1

  JOIN t2 ON ta.pk = t2.fk

WHERE

  col1 = col2

  AND col3 > col4

GROUP BY

  col1

  ,col2

HAVING

  COUNT(col3) > 1

lmkg6y ago

I hate hate hate the way it looks, but unfortunately it is objectively superior. This is why I prefer more languages go out of their way to make their syntax tolerant of optional trailing commas.

robert_tweed6y ago

I don't bother with this but instead try to keep column names alphabetical. I try to do the same thing in other languages with property names, etc. This gives greater stability in version control, as it removes the temptation to change the order for subjective reasons. Even better if it can be automated with linting.

This "fix" only helps with the case where new items are inserted at the end. By using alphabetic ordering you increase the chances that a new item will be inserted at the beginning or the middle, in which case it makes no difference where you put the commas.

It only really helps at all because inserting new items at the end is common, whereas removing items from the start is rare, at least in SQL (all it really does is displace the dangling comma problem from the former case to the latter). However, always inserting new items at the end tends to lead to unintuitive ordering, which in turn leads to additional VCS churn when that becomes seen as technical debt.

Commas at the start does give objectively better readability in one sense: it ensures they are aligned vertically. That makes it easier to spot errors at a glance. You might call this "ease of formal reviewability".

However, in practice it seems to be worse for general readability, for the entirely subjective reason you already pointed to. Since code is typically read more often than it is written, it's important to optimise for that first.

Since an error here will always cause a hard fail, it falls into a different category than say, omitting braces in a C-style if statement, which can introduce subtle bugs through bad merges. In the latter case, ease of formal reviewability has to take precedence over subjective aesthetics.

1 more reply

castorp6y ago

Well, this simply moves the problem of "commenting out" one column from the end of the list to the start of the list. I find myself doing both equally frequent.

1 more reply

mwexler6y ago

Similarly, by having the AND start in each subclause instead of being at the end of the previous clause, it's easier to "--" comment out specific clauses during development. Again, harder to read, but easier to work with.

1217896y ago

for whatever reason I actually find the leading "and/or" to be more readable than the trailing and/or, as long as there are more than two. The leading commas look too noisy to me, though

robert_tweed6y ago

Personally I find that one easier to read and tend to write e.g.:

  WHERE foo
  AND   bar

Or with long, complex ones:

  WHERE
      foo
  AND
      bar

19966y ago

Interesting. Can you please provide an example of what you mean?

1 more reply

Xcelerate6y ago

My OCD simply refuses to allow me to write it that way. I also tend to write SQL in lowercase which is apparently semi-controversial.

sbmarkb6y ago

My (diagnosed) OCD compels me to use leading commas. I also use "WHERE 1=1" so each where constraint begins with AND/OR. I tend to write a lot of "investigatory" SQL that requires frequent commenting. I rarely need to comment out the first column or constraint, so commenting in general is much cleaner & faster.

I did not learn SQL this way. I was shown these tricks and their adoption was near immediate BECAUSE it just made more sense to my brain.

This isn't to devalue/argue against your comment. I just find it interesting that the common denominator for OCD-ish compulsions is that the "right" way for an individual OCD brain is usually specific to that individual brain.

Svip6y ago

If you write a lot of SQL, writing it in lowercase is OK. That's what I do anyway. If I insert simple prepared statements into code, I usually write keywords with uppercase, but larger code blocks, I write in lowercase. I also tend to write in much fewer lines than the above example, and indent FROM and other clauses further in than the SELECT keyword (so I know they belong to that SELECT statement).

monkeycantype6y ago

I wish I could use ON for the selection criteria for the first table instead of a where clause:

Select A.value, B.valuue

from tableA A on A.id = 77

join tableB B on B.id = A.bId

kbenson6y ago

> 2019-22-11: Fixed the examples in the "Faux Predicate" section after several keen eyed readers noticed it was backwards.

What abomination of a date format is this? I can only assume this is a bug, a typo, or an easter egg for those paying attention. Please let it be one of those. The last thing the world needs is people pushing yet another crazy date format into use.

jandrese6y ago

It looks like a typo. Hebrew date style is yyyymmdd[1].

[1] https://www.ibm.com/support/knowledgecenter/en/SSS28S_8.1.0/...

gigatexal6y ago

Edit: “Don’t use an ORM” should be point 1

ars6y ago

The opposite. Point 1 should be don't use an ORM unless you don't know SQL. But you should know SQL so don't use an ORM.

An ORM only works until the point where you need to join tables. As soon as that's needed the ORM just causes you endless trouble.

gigatexal6y ago

Edited, i meant don’t use one.

j / k navigate · click thread line to collapse

66 comments

Svip6y ago

While counting columns will not include NULL columns, how about counting joined tables?

  SELECT a.id, COUNT(b.*) FROM a JOIN b ON b.a_id = a.id GROUP BY a.id

is not permitted in Postgres.

Sure, I could just use COUNT(b.a_id) since that's what I join on, but a more complicated example might not allow for that. For instance if it was a virtual table.

gnud6y ago

I'm sorry, if you want to count NULL-rows in b.* , how will it ever be different from just COUNT( * )? Maybe I'm misunderstanding what you're after?

Svip6y ago

You're right, it's a bad example, imagine if I joining two tables:

  SELECT a.id, COUNT(b.*), COUNT(c.*)
    FROM a JOIN b ON b.a_id = a.id JOIN c ON c.a_id = a.id
    GROUP BY a.id

  JOIN (SELECT NULL AS foo, 1 AS bar 
        UNION SELECT 1 AS foo, NULL AS bar) b ON b.foo = a.id OR b.bar = a.id

This would be useful if we are aggregating data together, where essentially, there are two ways to join the data with the main table, and both columns can be null.

2 more replies

andreareina6y ago

Any reason a regular count() wouldn't work?
SELECT a.id, count(
) FROM ...

Svip6y ago

It would, I realised after I read the replies, that I should have used the two joined table scenario as per my reply to your sibling.

adamiscool86y ago

Some of these have been learned through trial and error over the years, but a few were new and great to know.

On a related note, is the MCSE the gold standard for SQL education? Have been looking for a way to brush up and formalize my SQL skills.

Dowwie6y ago

for reference: https://sudonull.com/posts/998-Important-changes-in-the-CTE-...

Generally speaking, some clarification would be helpful!

rgharris6y ago

I think the article and you are correct - the article is worded a little oddly though and ignores the fact that in Postgres 12 CTEs that are referenced multiple times are MATERIALIZED by default.

Before Postgres 12 CTEs were always materialized so you did not get any query optimization benefits of CTEs acting like inline subqueries.

Their example is showing that you can let Postgres (before 12) optimize a CTE for you by writing it as an inline subquery instead of a CTE:

  SELECT *
  FROM (
    SELECT *
    FROM sale
  ) AS inlined
  WHERE created_by_id = 1

But with Postgres 12 their "don't" example would result in an index scan without refactoring to the "do" example. Basically their advice on do vs don't applies to before Postgres 12.

https://www.postgresql.org/docs/12/queries-with.html is pretty thorough on this

Dowwie6y ago

Thanks for confirming

tempguy99996y ago

This is a pretty trivial list. Useful for beginners I guess.

wfriesen6y ago

It's especially egregious in the ORDER BY, since there you have the option of using column aliases.

commandlinefan6y ago

Are you saying you can't use column aliases in group by? What version of Postgres are you using? I just tried it in 11.5 and it worked:

    # select cust_id as c, sum(avail_balance) as b from account group by c order by b;

2 more replies

tempguy99996y ago

I always, always forgot what column aliases I can use where. Thanks for the reminder.

irrational6y ago

It's useful for people new to Postgres, since many of these things are particular to Postgres.

tempguy99996y ago

Not really. Most is pretty well standard SQL (CTE optimisation fence pre PG12 being one exception, and there are a couple more, but really it's mostly standard stuff).

2 more replies

godshatter6y ago

oarabbus_6y ago

Coalesce and NVL are synonyms for each other.

godshatter6y ago

Foobar85686y ago

I would add to the common mistakes (should be generic, but I have more xp with sql server) :

not indexing, most often, tables are not or poorly indexed.

Implicit conversion can generate a lot of io/leads to poor perf or just not using indexes.

Sql function:sorry but they are most often crap and useless, better to in-line or use TVF, and no its not code logic duplication.

Read uncommitted unless you enjoy not reading rows, multiple times or half of a value (page split and/or LOB values)

SigmundA6y ago

> Sql function:sorry but they are most often crap and useless, better to in-line or use TVF, and no its not code logic duplication.

Functions have helped me tremendously in SQL server, but you do have to know the issues, which can be taken advantage of to some degree.

pjungwir6y ago

2 more replies

castorp6y ago

SQL functions can be inlined in Postgres And Postgres does not support read uncommitted to begin with.

But I agree that implicit conversion is the root of a lot of evil

jve6y ago

> Read uncommitted unless you enjoy not reading rows, multiple times or half of a value (page split and/or LOB values)

Please elaborate.

Foobar85686y ago

https://www.mssqltips.com/sqlservertip/6072/sql-server-noloc...

esnard6y ago

In the "Avoid Transformations on Indexed Fields" part, I fail to understand how the example can work if you're applying the timezone computation on the right-hand side.

I'm not familiar with MS SQL (I've only worked with MySQL / PostgreSQL), can someone explain me how it works?

moron4hire6y ago

paulclinger6y ago

If you are asking about the timezone shift applied to a date, I think the engine converts the date to 00:00:00 timestamp and then does the timezone conversion.

jlarocco6y ago

I think the advice is correct, but the examples are not. When the transformation switches to the other side of the comparison it has to be inverted.

irrational6y ago

SELECT

  col1

  ,col2

  ,COUNT(col3)

FROM

  t1

  JOIN t2 ON ta.pk = t2.fk

WHERE

  col1 = col2

  AND col3 > col4

GROUP BY

  col1

  ,col2

HAVING

  COUNT(col3) > 1

lmkg6y ago

I hate hate hate the way it looks, but unfortunately it is objectively superior. This is why I prefer more languages go out of their way to make their syntax tolerant of optional trailing commas.

robert_tweed6y ago

1 more reply

castorp6y ago

Well, this simply moves the problem of "commenting out" one column from the end of the list to the start of the list. I find myself doing both equally frequent.

1 more reply

mwexler6y ago

1217896y ago

for whatever reason I actually find the leading "and/or" to be more readable than the trailing and/or, as long as there are more than two. The leading commas look too noisy to me, though

robert_tweed6y ago

Personally I find that one easier to read and tend to write e.g.:

  WHERE foo
  AND   bar

Or with long, complex ones:

  WHERE
      foo
  AND
      bar

19966y ago

Interesting. Can you please provide an example of what you mean?

1 more reply

Xcelerate6y ago

My OCD simply refuses to allow me to write it that way. I also tend to write SQL in lowercase which is apparently semi-controversial.

sbmarkb6y ago

I did not learn SQL this way. I was shown these tricks and their adoption was near immediate BECAUSE it just made more sense to my brain.

Svip6y ago

monkeycantype6y ago

I wish I could use ON for the selection criteria for the first table instead of a where clause:

Select A.value, B.valuue

from tableA A on A.id = 77

join tableB B on B.id = A.bId

kbenson6y ago

> 2019-22-11: Fixed the examples in the "Faux Predicate" section after several keen eyed readers noticed it was backwards.

jandrese6y ago

It looks like a typo. Hebrew date style is yyyymmdd[1].

[1] https://www.ibm.com/support/knowledgecenter/en/SSS28S_8.1.0/...

gigatexal6y ago

Edit: “Don’t use an ORM” should be point 1

ars6y ago

The opposite. Point 1 should be don't use an ORM unless you don't know SQL. But you should know SQL so don't use an ORM.

An ORM only works until the point where you need to join tables. As soon as that's needed the ORM just causes you endless trouble.

gigatexal6y ago

Edited, i meant don’t use one.

j / k navigate · click thread line to collapse