After 20+ years of SQL usage (as an ordinary dev, not business/reporting/heavy SQL dev), I learned about `group by cube` from this article...
"group by cube/coalesce" is much more complicated thing in this article than "having" (that could be explained as where but for group by)
Using GROUP BY CUBE(a, b, c, ...) creates GROUP BY expressions for every element in the power set of {a, b, c, ...}, so GROUP BY CUBE(a, b) does separate GROUP BYs for (a, b), (a), (b) and ().
It's like SQL's version of a pivot table, returning aggregations of data filtered along multiple dimensions, and then also the aggregations of those aggregations.
It seems like it's well supported by Postgres [1], SQL Server [2] and Oracle [3], but MySQL only has partial support for ROLLUP with a different syntax [4].
[1]: https://www.postgresql.org/docs/current/queries-table-expres...
[2]: https://docs.microsoft.com/en-us/sql/t-sql/queries/select-gr...
[3]: https://oracle-base.com/articles/misc/rollup-cube-grouping-f...
[4]: https://dev.mysql.com/doc/refman/8.0/en/group-by-modifiers.h...
Is there such a book?
Some also support the MySQL/MariaDB with rollup syntax introduced in 2018.
MS SQL Server has native pivot support, and Postgres has a limited emulation of it through the crosstab(...) function. https://stackoverflow.com/a/11751905 https://www.postgresql.org/docs/current/tablefunc.html
For folks just learning about ROLLUP et al, I highly recommend this comparison chart for an overview of major features offered by modern relational databases. https://www.sql-workbench.eu/dbms_comparison.html
There's a whole constellation of advanced features out there that arguably most application developers are largely unaware of. (Which explains why most app devs still treat relational databases like dumb bit buckets at the far end of their ORMs.)
If I had a fancy DB I could use CUBE or GROUPING SETS and MATERIALIZED VIEWs to easily pre-calculate statistics for every combination of filter parameters that automatically get updated when the source data changed. But I had MariaDB so I made do. I ended up with something like this:
SELECT ... SUM(ABS(r.ilength)) AS distance, COUNT(*) AS intervals FROM r
GROUP BY average_retro_bucket, customer, `year`, lane_type, material_type, state, county, district WITH ROLLUP
HAVING average_retro_bucket IS NOT NULL AND customer IS NOT NULL;
"The WITH ROLLUP modifier adds extra rows to the resultset that represent super-aggregate summaries. The super-aggregated column is represented by a NULL value. Multiple aggregates over different columns will be added if there are multiple GROUP BY columns."So you can query like this to get stats for all districts in CA->Mendocino county:
SELECT * FROM stats_table WHERE state = 'CA' AND county = 'Mendocino' AND district IS NULL
or like this to get a single aggregate of all the counties in CA put together: SELECT * FROM stats_table WHERE state = 'CA' AND county IS NULL AND district IS NULL
However unlike CUBE, WITH ROLLUP doesn't create aggregate result sets for each combination of grouping columns. If one grouping column is a NULL aggregate, all the following ones are too. So if you want to query all the years put together but only in CA, you can't do: SELECT * FROM stats_table WHERE year IS NULL AND state = 'CA'
If `year` is null, all the following columns are as well. The solution was to manually implement wildcards before the last filtered group column by combining the rows together in the backend.I worked around not having materialized views by creating an EVENT that would re-create the stats tables every night. The stats don't really need to be real-time. Re-writing the multiple-GB statistics tables every night will wear out the SSDs in 20 years or so, oh well.
`group by cube`/`group by coalesce` aren't special advanced features, they're just `group by`. You can group on 'anything', e.g. maybe you want to group on a name regardless of case or extraneous whitespace - you can use functions like `lower` and `strip` in the `group by` no problem, it's not something to learn separately for every function.
I’ve been a developer for 15 years, consider myself pretty good with SQL, but I’m only now learning about group by cube.
Caesar if you see this thanks for being a great mentor.
Their docs go into more detail (https://docs.snowflake.com/en/sql-reference/constructs/quali...), but the short version is that typically SELECT is evaluated in the order FROM, WHERE, GROUP BY, HAVING, WINDOW, DISTINCT, ORDER BY, LIMIT.
But what happens if you want to filter on the result of a WINDOW? Sorry, time to write a nested query and bemoan the non-composability of SQL.
Snowflake adds QUALIFY, which is executed after WINDOW and before DISTINCT. Therefore you can write some pretty interesting tidy queries, like this one I've been using at work:
SELECT
*,
row_number() OVER (
PARTITION BY quux.record_time
ORDER BY quux.projection_time DESC
) AS record_version
FROM foo.bar.quux AS quux
WHERE (NOT quux._bad OR quux._bad IS NULL)
QUALIFY record_version = 1
Without QUALIFY, you'd have to nest queries (ugh): SELECT *
FROM (
SELECT
*,
row_number() OVER (
PARTITION BY quux.record_time
ORDER BY quux.projection_time DESC
) AS record_version
FROM foo.bar.quux AS quux
WHERE (NOT quux._bad OR quux._bad IS NULL)
) AS quux_versioned
WHERE quux_versioned.record_version = 1
or use a CTE (depending on whether your database inlines CTEs).I definitely pine for some kind of optimizing Blub-to-SQL compiler that would let me write my SQL like this instead:
(query (from foo.bar.quux :as quux)
(select *)
(where (or (not quux._bad)
(null quux._bad))
(select (row-number :over (:partition-by quux.record-time
:order-by (:desc quux.projection-time))
:as record-version)
(where (= 1 record-version)))I've been playing with malloy[1] that lets you pipeline/nest queries like you are describing here.
source: quux as from_sql(..) {
where: record_version = 1
where: _bad != null
}
1. https://looker-open-source.github.io/malloy/documentation/la...SELECT year, COUNT(*) sales, SUM(price) income FROM sales HAVING sales > 10 AND income < 1000;
SELECT *
FROM (SELECT year, COUNT(*) as nb_sales, SUM(price) as income
FROM sales)
WHERE nb_sales > 10 AND income < 1000;
work just like your example?SELECT year, sales, income FROM ( SELECT year, COUNT(*) sales, SUM(price) income FROM sales ) AS innerquery WHERE sales > 10 AND income < 1000;
Stage 1: From returns the whole world of rows.
Stage 2: Where filters down to the desired set of rows.
Stage 3: Group By aggregates the filtered rows.
Stage 4: Having filters again on the aggregated result.
Stage 5: Select picks out the columns.
(I know that I am probably missing some important technical points, I would like to learn about them)
WITH A AS (
SELECT x, sum(y) AS z
FROM SomeTable
GROUP BY x
)
SELECT * FROM A WHERE z > 10
With an HAVING clause you can instead just tuck it after the GROUP BY clause.Also, although it's not an issue these days given how good query planners are (any decent engine will produce exactly the same query plan with a subquery or an having clause, it's indexes that fuck up stuff), but you're signaling that the filter happens "at the end".
It's like having both "while" and "for" in a programming language. Technically you don't need it, but it's for humans and not for compilers.
select product
, sum(price) as price
from table
where price<1000
having price>10000
You can refer to the aliased price column before or after aggregation using where or having. Depending on the sql engine.Find me the list of non-deleted users who have more than 50 dollars worth of transactions in the transaction table.
Technically you can always subquery and use a where instead of a having but its nice to ... have.
It would be nice if we could have WHERE statements anywhere. I would like to put WHERE before joins for example. The optimizer can figure out that the filter should happen before the join but it is nice to have that clear when looking at the code and it often makes it easier to read the query because you have to hold less state in your head. You can quickly be sure that this query only deals with "users WHERE deleted" before worrying about what data gets joined in.
e.g. rather than having to do
SELECT
COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) AS district,
...
FROM ...
WHERE COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) IS NOT NULL
just do SELECT
COALESCE(extract_district(rm.district), extract_district(p.project_name), NULLIF(rm.district, '')) AS district,
...
FROM ...
HAVING district IS NOT NULL
Hopefully the optimizer understands that these are equivalent, I haven't checked.Select pk columns + count(column that might have dups) From table Group by pk Having count(*) > 1
Although perhaps that shouldn't annoy me anyway, it ought to be better than being specific but inaccurate (which you could probably expect from someone unfamiliar enough to say 'a Docker') - mixing up image/container as people do.
In my hiring experience it always acted as my first sql impression of candidates and 100% filter of people who don’t actually know sql besides some basic inner joins and aggregations. There are always exceptions i guess. :-)
It seems like the author is using one query per country. Where it seems like you’d just group by country and year, where country <> US
You would need some unions to bolt on the additional aggregations, but it’s more like 4 queues, not 14
Eg
select c.ctry_name, i.year_nbr, sum(i.item_cnt) as tot_cnt, sum(i.invoice_amt) as tot_amt from country c inner join invoice i on (i.ctry_code = c.ctry_code) where c.ctry_name <> 'USA' group by c.ctry_name, i.year_nbr
Yeah, four queries for the four requirements seems the most straightforward, and easier to maintain than the final version.
But still a very nice illustrating of group by cube.
It's tedious to construct an example database of, say, fifth normal form, which would show the actual utility of this kind of technique. So we're left with a highly detailed query, with some redundancies which wouldn't be redundant with more tables.
Clearly not since they got it down to a few seconds ;).
But tongue in cheek aside, it's incredibly dependent on what is in those 4M rows, how big they are, if they are indexed, whether you join in weird ways, whether the query planner does something unexpected.
SQL tooling is by and large pretty barbaric compared to most other tool these days, Intellij (and various spin off language specific IDE's) have the best tooling I've seen in that area but even then it's primitive.
For one thing, in SQLite, it doesn't. Which is a weak argument for not using it on supported systems. The other weak argument is that a RIGHT JOIN is just the b, a version of a LEFT JOIN a, b.
When you add them up it's an extra concept, SQL execution flow is already somewhat unintuitive, and a policy of using one of the two ways of saying "everything from a and matches from b" makes for a more consistent codebase.
I would hope a blue-sky relational query language wouldn't support two syntaxes for an operation which is non-commutative, when order is important it can and should be indicted by order.
I also dislike code that does not have all lines included. The UNION clauses here are missing which i simply find irritating.