https://medium.com/airbnb-engineering/how-airbnb-achieved-me...
(As the article notes, Transform's founders are Airbnb vets who worked on Minerva)
SQL doesn't really have a way to define new aggregations well. So taking something like Daily Active Users / Monthly Active Users (DAU/MAU), which is a semi complicated calculation using windowed data, there's no good way to have a repeatable DAU/MAU aggregation concept you could reuse at different levels without having to re-write that sql that defines the aggregation. EG DAU/MAU against the account, or DAU/MAU by user type, or DAU/MAU by region would all be different queries/reports.
Instead if you have a layer that can be given the definitions of how a particular aggregation is structured, and it knows how data is related to each other, you can ask for that metric against different concepts or levels without having to repeat yourself.
Extending that out, if BI tools can understand that layer, then you can get consistent WYSIWYG report building across multiple tools without having to redefine the calculations everywhere.
I don't think I did a great job explaining here, this keynote is a good topic survey IMO: https://www.youtube.com/watch?v=w2weqVjbUTs&feature=emb_imp_...
For example, if you want to be able to aggregate revenue by subscription plan and by country over year, month or day, then you would have 6+ different aggregations. The more dimensions and time periods you add, the more the various combinations blow out. (I think the formula is dimension permutations multiplied by the number of time periods.) Having a single definition of the metrics prevents errors from occurring in rote implementations of the metrics. A bit like generics, but for queries.
At this point, I'm curious how they manage to make the integrations with different platforms work. (Ultimately, I think that's why they acquired Transform.)
Anyhoo, all pure speculation. While I've never used it, my friends have great things to say about DBT.
There you define the relationships between tables, which dimensions to expose to your users, and how measures are to be aggregated along which dimensions and at what roll ups. Filters, formatting, and other semantic information rounds out what’s possible, making it easier to build complex reporting without having to pre-build large joined tables in the warehouse.
In theory a semantic layer decoupled from BI can be useful too, as downstream systems can all query a metric. Maybe an operational workflow needs access to revenue numbers or a spreadsheet wants a feed of some fresh aggregates.
The real question is will BI tools adopt a semantic layer they don’t own, giving up control over a key part of their input into their visualizations? Time will tell.
- Dependency tracking
- Not just views, but tables, and incremental tables
- Testing framework
I've worked on a few data teams that have all been "copy/paste SQL from GitHub into the Query Editor" and it's (obviously) pretty bad. DBT is super low-lift and professionalizes your data pipeline basically instantly.
- Dev/stag/prod env check numbers before pushing to production.
- Unions between two sources that are not the same shape can be done without the headache. https://github.com/dbt-labs/dbt-utils#union_relations-source
- Macros for common case when statements.
dbt is the V1. You get a lot of tooling, including a proper dag, logging, parametrization. You also get the ability to easily materialize your tables in a convenient format, which is important if (probably when) you figure out consistency is important. Views can take you far, but most orgs will eventually need more, and dbt is designed to be exactly that.
As a side note, moving from views to dbt is actually quite easy. I've done it several times and it's usually taken a couple of developer days to get started and maybe a couple weeks to fully transition.