Having, a less understood SQL clause (opens in new tab)

(smallthingssql.com)

169 pointsvpanyam3y ago81 comments

81 comments

Having is less understood?

After 20+ years of SQL usage (as an ordinary dev, not business/reporting/heavy SQL dev), I learned about `group by cube` from this article...

"group by cube/coalesce" is much more complicated thing in this article than "having" (that could be explained as where but for group by)

CodeIsTheEnd3y ago

I had never heard of GROUP BY CUBE either! It looks like it's part of a family of special GROUP BY operators—GROUPING SETS, CUBE, and ROLLUP—that basically issue the same query multiple times with different GROUP BY expressions and UNION the results together.

Using GROUP BY CUBE(a, b, c, ...) creates GROUP BY expressions for every element in the power set of {a, b, c, ...}, so GROUP BY CUBE(a, b) does separate GROUP BYs for (a, b), (a), (b) and ().

It's like SQL's version of a pivot table, returning aggregations of data filtered along multiple dimensions, and then also the aggregations of those aggregations.

It seems like it's well supported by Postgres [1], SQL Server [2] and Oracle [3], but MySQL only has partial support for ROLLUP with a different syntax [4].

[1]: https://www.postgresql.org/docs/current/queries-table-expres...

[2]: https://docs.microsoft.com/en-us/sql/t-sql/queries/select-gr...

[3]: https://oracle-base.com/articles/misc/rollup-cube-grouping-f...

[4]: https://dev.mysql.com/doc/refman/8.0/en/group-by-modifiers.h...

nerdponx3y ago

I would gladly buy a book of "SQL Recipes" ranging from beginner-level to advanced stuff that uses features like this, ideally with coverage of at least a few popular database systems, but at minimum Postgres.

Is there such a book?

hobs3y ago

In fact, Yugabyte is giving it away for free - https://downloads.yugabyte.com/marketing-assets/O-Reilly-SQL...

1 more reply

eatonphil3y ago

Joe Celko has a number of SQL for Smarties books I've been meaning to look through.

1 more reply

zweifuss3y ago

Yes, PostgreSQL (9 and later), SQL Server (2008 and later), Oracle, SQL Anywhere and DB2 support grouping sets. All had it for longer than a decade.

Some also support the MySQL/MariaDB with rollup syntax introduced in 2018.

ttfkam3y ago

Not a pivot table equivalent. Most useful for calculating multiple related aggregates at once for reporting purposes, but ROLLUP doesn't substitute values for columns, ie. it doesn't pivot results on an axis.

MS SQL Server has native pivot support, and Postgres has a limited emulation of it through the crosstab(...) function. https://stackoverflow.com/a/11751905 https://www.postgresql.org/docs/current/tablefunc.html

For folks just learning about ROLLUP et al, I highly recommend this comparison chart for an overview of major features offered by modern relational databases. https://www.sql-workbench.eu/dbms_comparison.html

There's a whole constellation of advanced features out there that arguably most application developers are largely unaware of. (Which explains why most app devs still treat relational databases like dumb bit buckets at the far end of their ORMs.)

ttfkam3y ago

I should amend for clarity: GROUPING SETS are really useful for speeding up complex report generation.

mastax3y ago

I had a situation recently where I had a huge amount of data stored in a MariaDB database and I wanted to create a dashboard where users could interactively filter subsets and view the data. The naive solution of computing the aggregate statistics directly based on the users' filter parameters was too slow, most of the aggregation needed to be done ahead of time and cached. The website's backend code was a spaghetti house of horrors so I wanted to do as much as possible in the DB. (The first time in my career I chose to write more SQL rather than code)

If I had a fancy DB I could use CUBE or GROUPING SETS and MATERIALIZED VIEWs to easily pre-calculate statistics for every combination of filter parameters that automatically get updated when the source data changed. But I had MariaDB so I made do. I ended up with something like this:

    SELECT ... SUM(ABS(r.ilength)) AS distance, COUNT(*) AS intervals FROM r
    GROUP BY average_retro_bucket, customer, `year`, lane_type, material_type, state, county, district WITH ROLLUP
    HAVING average_retro_bucket IS NOT NULL AND customer IS NOT NULL;

"The WITH ROLLUP modifier adds extra rows to the resultset that represent super-aggregate summaries. The super-aggregated column is represented by a NULL value. Multiple aggregates over different columns will be added if there are multiple GROUP BY columns."

So you can query like this to get stats for all districts in CA->Mendocino county:

    SELECT * FROM stats_table WHERE state = 'CA' AND county = 'Mendocino' AND district IS NULL

or like this to get a single aggregate of all the counties in CA put together:

    SELECT * FROM stats_table WHERE state = 'CA' AND county IS NULL AND district IS NULL

However unlike CUBE, WITH ROLLUP doesn't create aggregate result sets for each combination of grouping columns. If one grouping column is a NULL aggregate, all the following ones are too. So if you want to query all the years put together but only in CA, you can't do:

    SELECT * FROM stats_table WHERE year IS NULL AND state = 'CA'

If `year` is null, all the following columns are as well. The solution was to manually implement wildcards before the last filtered group column by combining the rows together in the backend.

I worked around not having materialized views by creating an EVENT that would re-create the stats tables every night. The stats don't really need to be real-time. Re-writing the multiple-GB statistics tables every night will wear out the SSDs in 20 years or so, oh well.

kmac_3y ago

SQL opens up when used with OLAP schemas. Most devs are experienced in querying "object mapped" schemas where cube, roll up, etc. are not useful. Nothing bad per se, but it can give an impression that SQL is a bad language, when actually it clicks well with a proper data schema.

rjbwork3y ago

Indeed. I think your mind can really be opened by having to answer complex business questions with an expansive and well designed data warehouse schema. It's a shame it's such a relatively niche and unknown topic, especially in the startup world.

nerdponx3y ago

This is why the data engineers get paid the big bucks, and also why having a good data engineer is a lot more important than a good data scientist in the early stages of a company.

OJFord3y ago

I've never used `cube` in any context, but if I may I'd suggest you're parsing this wrongly:

`group by cube`/`group by coalesce` aren't special advanced features, they're just `group by`. You can group on 'anything', e.g. maybe you want to group on a name regardless of case or extraneous whitespace - you can use functions like `lower` and `strip` in the `group by` no problem, it's not something to learn separately for every function.

zweifuss3y ago

Cube gets all possible combinations of grouping sets. It´s like showing all subtotals in a pivot table. That´s different than just grouping on the lowest level without totals.

remram3y ago

Your suggestion is incorrect. CUBE is not part of the expression.

OJFord3y ago

Well, I did say I wasn't familiar with it, but it's correct for `coalesce` and I don't think `cube` is different: https://www.postgresql.org/docs/current/cube.html

In the context of `group by` it's treated as grouping sets, but that's not its only use. (Though that does seem to be special cased in terms of parsing, since afaict - I can't find the full query BNF on mobile - `grouping sets` is not optional.)

1 more reply

cm21873y ago

Exactly the same for me. It makes me think I need to start reading the TSQL documentation cover to cover.

hobs3y ago

Not a huge fan of Microsoft but the TSQL documentation is solid. If you're not using CROSS APPLY to tear apart things and put them back together you've not lived.

Simon_O_Rourke3y ago

Heard of group by cube before, even tried it in production, but it was a total dog of a query to run, so I filed it away to try again in a few years.

RexM3y ago

Same!

I’ve been a developer for 15 years, consider myself pretty good with SQL, but I’m only now learning about group by cube.

ramraj073y ago

The only “standard” feature id rather try not to understand is recursive CTEs lol

conceptme3y ago

it's pretty useful when working with hierarchical data, but you do not to put some check for cyclical relations, I have seen those take an application down :D.

dspillett3y ago

If you aren't careful you can cause that without infinite recursion. If the query optimiser an't see to push relevant predicates down to the root level where needed for best performance, or they are not sargable anyway, or the query optimiser simply can't do that (until v14 CTEs were an optimisation barrier in posrges), then you end up scanning whole tables (or at least whole indexes) multiple times, where a few seeks might be all that is really needed. In fact, you don't even need recursion for this to have a bad effect.

CTEs are a great feature for readability, but when using them be careful to test your work on data at least as large as you expect to see in production over the life of the statements you are working on, in all DBMSs you support if your project is multi-platform.

eyelidlessness3y ago

they’re just self joins on views of the query.

teddyh3y ago

HAVING is less understood? There’s nothing strange about HAVING, it’s just like WHERE, but it applies after GROUP BY has grouped the rows, and can use the grouped row values. (Obviously, this is only useful if you actually have a GROUP BY clause.) If HAVING did not exist, you could just as well do the same thing using a subselect (i.e. doing SELECT * FROM (SELECT * FROM … WHERE … GROUP BY …) WHERE …; is, IIUC, equivalent to SELECT * FROM … WHERE … GROUP BY … HAVING …;)

syntheticcdo3y ago

FWIW I just tested a somewhat complex query using HAVING vs a sub-select as you indicated and Postgres generated the same query plan (and naturally, results) for both.

djbusby3y ago

A DB wizard I used to work for showed me HAVING after looking at this nasty sub-sub select I did, with some app-layer-loop after.

Caesar if you see this thanks for being a great mentor.

nerdponx3y ago

Snowflake SQL also has the interesting feature QUALIFY.

Their docs go into more detail (https://docs.snowflake.com/en/sql-reference/constructs/quali...), but the short version is that typically SELECT is evaluated in the order FROM, WHERE, GROUP BY, HAVING, WINDOW, DISTINCT, ORDER BY, LIMIT.

But what happens if you want to filter on the result of a WINDOW? Sorry, time to write a nested query and bemoan the non-composability of SQL.

Snowflake adds QUALIFY, which is executed after WINDOW and before DISTINCT. Therefore you can write some pretty interesting tidy queries, like this one I've been using at work:

    SELECT
      *,
      row_number() OVER (
        PARTITION BY quux.record_time
        ORDER BY quux.projection_time DESC
      ) AS record_version
    FROM foo.bar.quux AS quux
    WHERE (NOT quux._bad OR quux._bad IS NULL)
    QUALIFY record_version = 1

Without QUALIFY, you'd have to nest queries (ugh):

    SELECT *
    FROM (
      SELECT
        *,
        row_number() OVER (
          PARTITION BY quux.record_time
          ORDER BY quux.projection_time DESC
        ) AS record_version
      FROM foo.bar.quux AS quux
      WHERE (NOT quux._bad OR quux._bad IS NULL)
    ) AS quux_versioned
    WHERE quux_versioned.record_version = 1

or use a CTE (depending on whether your database inlines CTEs).

I definitely pine for some kind of optimizing Blub-to-SQL compiler that would let me write my SQL like this instead:

    (query (from foo.bar.quux :as quux)
           (select *)
           (where (or (not quux._bad)
                      (null quux._bad))
           (select (row-number :over (:partition-by quux.record-time
                                      :order-by (:desc quux.projection-time))
                               :as record-version)
           (where (= 1 record-version)))

someforevery3y ago

> I definitely pine for some kind of optimizing Blub-to-SQL compiler

I've been playing with malloy[1] that lets you pipeline/nest queries like you are describing here.

  source: quux as from_sql(..) {
    where: record_version = 1 
    where: _bad != null  
  }

1. https://looker-open-source.github.io/malloy/documentation/la...

larsrc3y ago

But HAVING can also act on aggregate results. In fact, the example in the article is not the most important use for HAVING. Subselects can't do something like.

SELECT year, COUNT(*) sales, SUM(price) income FROM sales HAVING sales > 10 AND income < 1000;

p4bl03y ago

Why wouldn't

    SELECT *
    FROM (SELECT year, COUNT(*) as nb_sales, SUM(price) as income
          FROM sales)
    WHERE nb_sales > 10 AND income < 1000;

work just like your example?

dspillett3y ago

For more complicated examples, HAVING can produce easier to read/understand/maintain statements.

There may also be performance differences depending on the rest of the query, but for simple examples like this exactly the same plan will be used, so the performance will be identical.

Unfortunately, the simplest examples do not always illustrate the potential benefits of less common syntax.

1 more reply

Linosaurus3y ago

This should work, right? Just a bit more unnecessary text.

SELECT year, sales, income FROM ( SELECT year, COUNT(*) sales, SUM(price) income FROM sales ) AS innerquery WHERE sales > 10 AND income < 1000;

ww5203y ago

Having is the where-clause for Group By. It's easier to understand by thinking the SQL query as a pipeline.

Stage 1: From returns the whole world of rows.

Stage 2: Where filters down to the desired set of rows.

Stage 3: Group By aggregates the filtered rows.

Stage 4: Having filters again on the aggregated result.

Stage 5: Select picks out the columns.

occamrazor3y ago

What I never understood is why HAVING and WHERE are different clauses. AFAIU, there are no cases where both could be used, so why can’t one simply use WHERE after a GROUP BY?

(I know that I am probably missing some important technical points, I would like to learn about them)

qsort3y ago

It doesn't improve the power of SQL, it's just syntactic sugar. Because of how the SQL syntax works, you'd have to do something like:

   WITH A AS (
        SELECT x, sum(y) AS z
          FROM SomeTable
      GROUP BY x
   )

   SELECT * FROM A WHERE z > 10

With an HAVING clause you can instead just tuck it after the GROUP BY clause.

Also, although it's not an issue these days given how good query planners are (any decent engine will produce exactly the same query plan with a subquery or an having clause, it's indexes that fuck up stuff), but you're signaling that the filter happens "at the end".

It's like having both "while" and "for" in a programming language. Technically you don't need it, but it's for humans and not for compilers.

snidane3y ago

    select product
         , sum(price) as price 
      from table
     where price<1000
    having price>10000

You can refer to the aliased price column before or after aggregation using where or having. Depending on the sql engine.

occamrazor3y ago

I didn’t know one could still refer to the value before aggregation after it is aliased.

Is there an implicit group by in this query?

mwexler3y ago

This is non-standard and highly dependent on the sql engine. If you believe in portability, your where clauses should (sadly) use the long form.

If that's not a requirement, this approach can add some clarity

hobs3y ago

There's plenty of cases - remove the people who have attribute, then aggregate them, then filter the aggregate.

Find me the list of non-deleted users who have more than 50 dollars worth of transactions in the transaction table.

Technically you can always subquery and use a where instead of a having but its nice to ... have.

croes3y ago

I have pretty often cases where I use where and having. Where filters the results before the group and having afterwards.

629514133y ago

https://blog.jooq.org/a-beginners-guide-to-the-true-order-of...

kevincox3y ago

HAVING is a hack. Because SQL has such an inflexible pipeline we needed a second WHERE clause for after GROUP BY.

It would be nice if we could have WHERE statements anywhere. I would like to put WHERE before joins for example. The optimizer can figure out that the filter should happen before the join but it is nice to have that clear when looking at the code and it often makes it easier to read the query because you have to hold less state in your head. You can quickly be sure that this query only deals with "users WHERE deleted" before worrying about what data gets joined in.

bonzini3y ago

You can put WHERE before joins using nested queries, i.e. "FROM (SELECT * FROM users WHERE deleted") INNER JOIN ...".

mastax3y ago

I like using HAVING just to have conditions that reference expressions from the SELECT columns.

e.g. rather than having to do

    SELECT
        COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) AS district,
        ...
    FROM ...
    WHERE COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) IS NOT NULL

just do

    SELECT
        COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) AS district,
        ...
    FROM ...
    HAVING district IS NOT NULL

Hopefully the optimizer understands that these are equivalent, I haven't checked.

p4bl03y ago

This should work with WHERE too at least as long as the name given using AS in the SELECT is given to a row and not to an aggregate.

gigatexal3y ago

I use this all the time to find dups:

Select pk columns + count(column that might have dups) From table Group by pk Having count(*) > 1

tantalor3y ago

Calling each select "a sql" is really cute

OJFord3y ago

It doesn't irk me anything like as much as 'a Docker' (a Docker what? Usually container) for some reason.

Although perhaps that shouldn't annoy me anyway, it ought to be better than being specific but inaccurate (which you could probably expect from someone unfamiliar enough to say 'a Docker') - mixing up image/container as people do.

kaoD3y ago

Agree. Let me create a quick Jira for that.

OJFord3y ago

Thanks for your HN, appreciate it.

fzil3y ago

> Agree. Let me create a quick Jira for that.

Yeah, that really irritates me for some reason.

redhale3y ago

Forget HAVING -- in my opinion RANK is the most useful SQL function that barely anyone knows about. Has gotten me out of more sticky query issues than I can count.

jupiter9093y ago

The entire family of Windowing Functions which are mostly ubiquitous on all modern SQL seem to be glossed over by most.

mike2563y ago

RANK is a good example of a useful less known window function. Please also consider dense_rank and row_number. All 3 have their advantages and disadvantages... (Rank leaves holes after rows with the same order, dense_rank gives consecutive numbers, row_number just numbers the rows.)

higeorge133y ago

Having vs where is my first question to filter candidates who have less experience in sql than they claim.

notimetorelax3y ago

Well, I used SQL professionally for at least 7 years in different capacities and “having” was a new construct to me. You might say that I suck at SQL or you might realize that there’s more than one way to achieve a result in SQL. I usually prefer “with” statements. I hope this is not the only criteria you use to filter candidates. In my mind a better question would be to state the problem and see if the candidate can come up with a working SQL statement.

higeorge133y ago

I would ask you more questions like window functions, cte, joins, etc. or how would you address certain questions, and tbh i would be really curious that you know good sql but haven’t heard of having.

In my hiring experience it always acted as my first sql impression of candidates and 100% filter of people who don’t actually know sql besides some basic inner joins and aggregations. There are always exceptions i guess. :-)

SnowHill99023y ago

It’s also a hint that the candidate is not curious and doesn’t thoroughly read the docs which is a minus.

xivzgrev3y ago

Can anyone explain why the query without having needs 14 separate queries? That seemed insane to me.

It seems like the author is using one query per country. Where it seems like you’d just group by country and year, where country <> US

You would need some unions to bolt on the additional aggregations, but it’s more like 4 queues, not 14

select c.ctry_name, i.year_nbr, sum(i.item_cnt) as tot_cnt, sum(i.invoice_amt) as tot_amt from country c inner join invoice i on (i.ctry_code = c.ctry_code) where c.ctry_name <> 'USA' group by c.ctry_name, i.year_nbr

Linosaurus3y ago

> Can anyone explain why the query without having needs 14 separate queries? That seemed insane to me.

Yeah, four queries for the four requirements seems the most straightforward, and easier to maintain than the final version.

But still a very nice illustrating of group by cube.

samatman3y ago

The generous reading, which I'm inclined to, is that it's illustrating OLAP techniques which one would actually apply to a relational database of high normalization.

It's tedious to construct an example database of, say, fifth normal form, which would show the actual utility of this kind of technique. So we're left with a highly detailed query, with some redundancies which wouldn't be redundant with more tables.

pedrow3y ago

My SQL knowledge is very limited - I had heard of HAVING but not GROUP BY CUBE or COALESCE - but one thing stood out: "The rewritten sql ... ran in a few seconds compared to over half an hour for the original query." I know there were four million rows in the dataset, but is 30 minutes the kind of run-time you would expect for a query like this?

noir_lord3y ago

> I know there were four million rows in the dataset, but is 30 minutes the kind of run-time you would expect for a query like this?.

Clearly not since they got it down to a few seconds ;).

But tongue in cheek aside, it's incredibly dependent on what is in those 4M rows, how big they are, if they are indexed, whether you join in weird ways, whether the query planner does something unexpected.

SQL tooling is by and large pretty barbaric compared to most other tool these days, Intellij (and various spin off language specific IDE's) have the best tooling I've seen in that area but even then it's primitive.

fuy3y ago

I'd have to disagree with you on the tooling. Mature (especially commercial) SQL databases have a lot of great tooling built around them that inany areas surpass most of the language tools. For example, Extended events, which comes with SQL Server by default, allows you to trace/profile/filter hundreds of different events in real time, going from simple things like query execution tracing and going all the way to profiling locks, spinlocks, IO events and much more. There's (also built-in) another tool called Query Store that allows to track performance regressions, aggregate performance statistics etc. And then there's whole infrastructure of 3d party tools for things like execution plan analysis, capacity planning etc etc. Oracle has similar rich set of tools. Postgres is lacking some of those, but it's getting better. IDE support in JetBrains products is not that far away from, say, java experience, but from a pure coding perspective it's a bit behind.

mnsc3y ago

I think that when doing these kind of report style queries cubed/rolled up columns becomes labels and loses their data type. So it makes sense to convert+coalesce the year column as well to get 'All years' instead of the hard to interpret "null year".

flappyeagle3y ago

How would one use the query result in the example? Every rows having multiple grouping levels seems like a hard result set to use in any capacity other than a human reading it and interpreting it directly

beefield3y ago

Well, I consider myself somewhat fluent with SQL, and for some reason left joins are the ones that occasionally get me really confused - so much so that I actively try to avoid them. Trouble is not in the vanilla cases, but when you start throwing multiple tables and multiple where clauses in the same query, then there is something about left joins and nulls that is really unintuitive to my brain. Maybe I should spend some time and study the left join more...

kevincox3y ago

I think left join is a bad name. Probably something like OPTIONAL JOIN or TRY JOIN would be more obvious. Of course the problem is then what do you call a right join? REVERSE OPTIONAL JOIN? But that is getting pretty confusing now. But maybe worth it because in my experience left is far more common.

samatman3y ago

It's best to pretend that RIGHT JOIN doesn't exist, imnsho.

For one thing, in SQLite, it doesn't. Which is a weak argument for not using it on supported systems. The other weak argument is that a RIGHT JOIN is just the b, a version of a LEFT JOIN a, b.

When you add them up it's an extra concept, SQL execution flow is already somewhat unintuitive, and a policy of using one of the two ways of saying "everything from a and matches from b" makes for a more consistent codebase.

I would hope a blue-sky relational query language wouldn't support two syntaxes for an operation which is non-commutative, when order is important it can and should be indicted by order.

debugnik3y ago

> For one thing, in SQLite, it doesn't.

It does now, since the latest release 3.39, along with full join.

1 more reply

motoboi3y ago

It's actually pretty obvious name. You just need to visualize a Venn diagram and left, right and inner become obvious.

mmsimanga3y ago

You are right about the LEFT join coupled with a filter in the WHERE clause sometimes makes the join a normal INNER join. Just when this is the case is difficult for me to work out. This is often a trick question in some SQL assessments I have seen. I say this as someone with over 20 years of SQL almost on a daily basis. I avoid such scenarios by using CTEs.

rubyist5eva3y ago

WHERE filters before the aggregate. HAVING filters after the aggregate. What’s not to understand here?

danielovichdk3y ago

øøhhhh...this post was a poor write-up on something that have nothing to do with having.

I also dislike code that does not have all lines included. The UNION clauses here are missing which i simply find irritating.

JohnHaugeland3y ago

Having is where for after the collation.

j / k navigate · click thread line to collapse

81 comments

galkk3y ago

Having is less understood?

After 20+ years of SQL usage (as an ordinary dev, not business/reporting/heavy SQL dev), I learned about `group by cube` from this article...

"group by cube/coalesce" is much more complicated thing in this article than "having" (that could be explained as where but for group by)

CodeIsTheEnd3y ago

Using GROUP BY CUBE(a, b, c, ...) creates GROUP BY expressions for every element in the power set of {a, b, c, ...}, so GROUP BY CUBE(a, b) does separate GROUP BYs for (a, b), (a), (b) and ().

It's like SQL's version of a pivot table, returning aggregations of data filtered along multiple dimensions, and then also the aggregations of those aggregations.

It seems like it's well supported by Postgres [1], SQL Server [2] and Oracle [3], but MySQL only has partial support for ROLLUP with a different syntax [4].

[1]: https://www.postgresql.org/docs/current/queries-table-expres...

[2]: https://docs.microsoft.com/en-us/sql/t-sql/queries/select-gr...

[3]: https://oracle-base.com/articles/misc/rollup-cube-grouping-f...

[4]: https://dev.mysql.com/doc/refman/8.0/en/group-by-modifiers.h...

nerdponx3y ago

Is there such a book?

hobs3y ago

In fact, Yugabyte is giving it away for free - https://downloads.yugabyte.com/marketing-assets/O-Reilly-SQL...

1 more reply

eatonphil3y ago

Joe Celko has a number of SQL for Smarties books I've been meaning to look through.

1 more reply

zweifuss3y ago

Yes, PostgreSQL (9 and later), SQL Server (2008 and later), Oracle, SQL Anywhere and DB2 support grouping sets. All had it for longer than a decade.

Some also support the MySQL/MariaDB with rollup syntax introduced in 2018.

ttfkam3y ago

I should amend for clarity: GROUPING SETS are really useful for speeding up complex report generation.

mastax3y ago

    SELECT ... SUM(ABS(r.ilength)) AS distance, COUNT(*) AS intervals FROM r
    GROUP BY average_retro_bucket, customer, `year`, lane_type, material_type, state, county, district WITH ROLLUP
    HAVING average_retro_bucket IS NOT NULL AND customer IS NOT NULL;

So you can query like this to get stats for all districts in CA->Mendocino county:

    SELECT * FROM stats_table WHERE state = 'CA' AND county = 'Mendocino' AND district IS NULL

or like this to get a single aggregate of all the counties in CA put together:

    SELECT * FROM stats_table WHERE state = 'CA' AND county IS NULL AND district IS NULL

    SELECT * FROM stats_table WHERE year IS NULL AND state = 'CA'

If `year` is null, all the following columns are as well. The solution was to manually implement wildcards before the last filtered group column by combining the rows together in the backend.

kmac_3y ago

rjbwork3y ago

nerdponx3y ago

This is why the data engineers get paid the big bucks, and also why having a good data engineer is a lot more important than a good data scientist in the early stages of a company.

OJFord3y ago

I've never used `cube` in any context, but if I may I'd suggest you're parsing this wrongly:

zweifuss3y ago

Cube gets all possible combinations of grouping sets. It´s like showing all subtotals in a pivot table. That´s different than just grouping on the lowest level without totals.

remram3y ago

Your suggestion is incorrect. CUBE is not part of the expression.

OJFord3y ago

Well, I did say I wasn't familiar with it, but it's correct for `coalesce` and I don't think `cube` is different: https://www.postgresql.org/docs/current/cube.html

1 more reply

cm21873y ago

Exactly the same for me. It makes me think I need to start reading the TSQL documentation cover to cover.

hobs3y ago

Not a huge fan of Microsoft but the TSQL documentation is solid. If you're not using CROSS APPLY to tear apart things and put them back together you've not lived.

Simon_O_Rourke3y ago

Heard of group by cube before, even tried it in production, but it was a total dog of a query to run, so I filed it away to try again in a few years.

RexM3y ago

Same!

I’ve been a developer for 15 years, consider myself pretty good with SQL, but I’m only now learning about group by cube.

ramraj073y ago

The only “standard” feature id rather try not to understand is recursive CTEs lol

conceptme3y ago

it's pretty useful when working with hierarchical data, but you do not to put some check for cyclical relations, I have seen those take an application down :D.

dspillett3y ago

eyelidlessness3y ago

they’re just self joins on views of the query.

teddyh3y ago

syntheticcdo3y ago

FWIW I just tested a somewhat complex query using HAVING vs a sub-select as you indicated and Postgres generated the same query plan (and naturally, results) for both.

djbusby3y ago

A DB wizard I used to work for showed me HAVING after looking at this nasty sub-sub select I did, with some app-layer-loop after.

Caesar if you see this thanks for being a great mentor.

nerdponx3y ago

Snowflake SQL also has the interesting feature QUALIFY.

But what happens if you want to filter on the result of a WINDOW? Sorry, time to write a nested query and bemoan the non-composability of SQL.

Snowflake adds QUALIFY, which is executed after WINDOW and before DISTINCT. Therefore you can write some pretty interesting tidy queries, like this one I've been using at work:

    SELECT
      *,
      row_number() OVER (
        PARTITION BY quux.record_time
        ORDER BY quux.projection_time DESC
      ) AS record_version
    FROM foo.bar.quux AS quux
    WHERE (NOT quux._bad OR quux._bad IS NULL)
    QUALIFY record_version = 1

Without QUALIFY, you'd have to nest queries (ugh):

    SELECT *
    FROM (
      SELECT
        *,
        row_number() OVER (
          PARTITION BY quux.record_time
          ORDER BY quux.projection_time DESC
        ) AS record_version
      FROM foo.bar.quux AS quux
      WHERE (NOT quux._bad OR quux._bad IS NULL)
    ) AS quux_versioned
    WHERE quux_versioned.record_version = 1

or use a CTE (depending on whether your database inlines CTEs).

I definitely pine for some kind of optimizing Blub-to-SQL compiler that would let me write my SQL like this instead:

    (query (from foo.bar.quux :as quux)
           (select *)
           (where (or (not quux._bad)
                      (null quux._bad))
           (select (row-number :over (:partition-by quux.record-time
                                      :order-by (:desc quux.projection-time))
                               :as record-version)
           (where (= 1 record-version)))

someforevery3y ago

> I definitely pine for some kind of optimizing Blub-to-SQL compiler

I've been playing with malloy[1] that lets you pipeline/nest queries like you are describing here.

  source: quux as from_sql(..) {
    where: record_version = 1 
    where: _bad != null  
  }

1. https://looker-open-source.github.io/malloy/documentation/la...

larsrc3y ago

But HAVING can also act on aggregate results. In fact, the example in the article is not the most important use for HAVING. Subselects can't do something like.

SELECT year, COUNT(*) sales, SUM(price) income FROM sales HAVING sales > 10 AND income < 1000;

p4bl03y ago

Why wouldn't

    SELECT *
    FROM (SELECT year, COUNT(*) as nb_sales, SUM(price) as income
          FROM sales)
    WHERE nb_sales > 10 AND income < 1000;

work just like your example?

dspillett3y ago

For more complicated examples, HAVING can produce easier to read/understand/maintain statements.

There may also be performance differences depending on the rest of the query, but for simple examples like this exactly the same plan will be used, so the performance will be identical.

Unfortunately, the simplest examples do not always illustrate the potential benefits of less common syntax.

1 more reply

Linosaurus3y ago

This should work, right? Just a bit more unnecessary text.

SELECT year, sales, income FROM ( SELECT year, COUNT(*) sales, SUM(price) income FROM sales ) AS innerquery WHERE sales > 10 AND income < 1000;

ww5203y ago

Having is the where-clause for Group By. It's easier to understand by thinking the SQL query as a pipeline.

Stage 1: From returns the whole world of rows.

Stage 2: Where filters down to the desired set of rows.

Stage 3: Group By aggregates the filtered rows.

Stage 4: Having filters again on the aggregated result.

Stage 5: Select picks out the columns.

occamrazor3y ago

What I never understood is why HAVING and WHERE are different clauses. AFAIU, there are no cases where both could be used, so why can’t one simply use WHERE after a GROUP BY?

(I know that I am probably missing some important technical points, I would like to learn about them)

qsort3y ago

It doesn't improve the power of SQL, it's just syntactic sugar. Because of how the SQL syntax works, you'd have to do something like:

   WITH A AS (
        SELECT x, sum(y) AS z
          FROM SomeTable
      GROUP BY x
   )

   SELECT * FROM A WHERE z > 10

With an HAVING clause you can instead just tuck it after the GROUP BY clause.

It's like having both "while" and "for" in a programming language. Technically you don't need it, but it's for humans and not for compilers.

snidane3y ago

    select product
         , sum(price) as price 
      from table
     where price<1000
    having price>10000

You can refer to the aliased price column before or after aggregation using where or having. Depending on the sql engine.

occamrazor3y ago

I didn’t know one could still refer to the value before aggregation after it is aliased.

Is there an implicit group by in this query?

mwexler3y ago

This is non-standard and highly dependent on the sql engine. If you believe in portability, your where clauses should (sadly) use the long form.

If that's not a requirement, this approach can add some clarity

hobs3y ago

There's plenty of cases - remove the people who have attribute, then aggregate them, then filter the aggregate.

Find me the list of non-deleted users who have more than 50 dollars worth of transactions in the transaction table.

Technically you can always subquery and use a where instead of a having but its nice to ... have.

croes3y ago

I have pretty often cases where I use where and having. Where filters the results before the group and having afterwards.

629514133y ago

https://blog.jooq.org/a-beginners-guide-to-the-true-order-of...

kevincox3y ago

HAVING is a hack. Because SQL has such an inflexible pipeline we needed a second WHERE clause for after GROUP BY.

bonzini3y ago

You can put WHERE before joins using nested queries, i.e. "FROM (SELECT * FROM users WHERE deleted") INNER JOIN ...".

mastax3y ago

I like using HAVING just to have conditions that reference expressions from the SELECT columns.

e.g. rather than having to do

    SELECT
        COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) AS district,
        ...
    FROM ...
    WHERE COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) IS NOT NULL

just do

    SELECT
        COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) AS district,
        ...
    FROM ...
    HAVING district IS NOT NULL

Hopefully the optimizer understands that these are equivalent, I haven't checked.

p4bl03y ago

This should work with WHERE too at least as long as the name given using AS in the SELECT is given to a row and not to an aggregate.

gigatexal3y ago

I use this all the time to find dups:

Select pk columns + count(column that might have dups) From table Group by pk Having count(*) > 1

tantalor3y ago

Calling each select "a sql" is really cute

OJFord3y ago

It doesn't irk me anything like as much as 'a Docker' (a Docker what? Usually container) for some reason.

kaoD3y ago

Agree. Let me create a quick Jira for that.

OJFord3y ago

Thanks for your HN, appreciate it.

fzil3y ago

> Agree. Let me create a quick Jira for that.

Yeah, that really irritates me for some reason.

redhale3y ago

Forget HAVING -- in my opinion RANK is the most useful SQL function that barely anyone knows about. Has gotten me out of more sticky query issues than I can count.

jupiter9093y ago

The entire family of Windowing Functions which are mostly ubiquitous on all modern SQL seem to be glossed over by most.

mike2563y ago

higeorge133y ago

Having vs where is my first question to filter candidates who have less experience in sql than they claim.

notimetorelax3y ago

higeorge133y ago

SnowHill99023y ago

It’s also a hint that the candidate is not curious and doesn’t thoroughly read the docs which is a minus.

xivzgrev3y ago

Can anyone explain why the query without having needs 14 separate queries? That seemed insane to me.

It seems like the author is using one query per country. Where it seems like you’d just group by country and year, where country <> US

You would need some unions to bolt on the additional aggregations, but it’s more like 4 queues, not 14

Linosaurus3y ago

> Can anyone explain why the query without having needs 14 separate queries? That seemed insane to me.

Yeah, four queries for the four requirements seems the most straightforward, and easier to maintain than the final version.

But still a very nice illustrating of group by cube.

samatman3y ago

The generous reading, which I'm inclined to, is that it's illustrating OLAP techniques which one would actually apply to a relational database of high normalization.

pedrow3y ago

noir_lord3y ago

> I know there were four million rows in the dataset, but is 30 minutes the kind of run-time you would expect for a query like this?.

Clearly not since they got it down to a few seconds ;).

fuy3y ago

mnsc3y ago

flappyeagle3y ago

beefield3y ago

kevincox3y ago

samatman3y ago

It's best to pretend that RIGHT JOIN doesn't exist, imnsho.

For one thing, in SQLite, it doesn't. Which is a weak argument for not using it on supported systems. The other weak argument is that a RIGHT JOIN is just the b, a version of a LEFT JOIN a, b.

I would hope a blue-sky relational query language wouldn't support two syntaxes for an operation which is non-commutative, when order is important it can and should be indicted by order.

debugnik3y ago

> For one thing, in SQLite, it doesn't.

It does now, since the latest release 3.39, along with full join.

1 more reply

motoboi3y ago

It's actually pretty obvious name. You just need to visualize a Venn diagram and left, right and inner become obvious.

mmsimanga3y ago

rubyist5eva3y ago

WHERE filters before the aggregate. HAVING filters after the aggregate. What’s not to understand here?

danielovichdk3y ago

øøhhhh...this post was a poor write-up on something that have nothing to do with having.

I also dislike code that does not have all lines included. The UNION clauses here are missing which i simply find irritating.

JohnHaugeland3y ago

Having is where for after the collation.

j / k navigate · click thread line to collapse