Data teams I talk to can't turn to any single location to see every touchpoint their data goes through. They're relying on each tool's independent scheduling system and hoping that everything runs at the right time without errors. If something breaks, bad data gets deployed and it becomes a mad scramble to verify which tool caused the error and which reports/dashboards/ML models/etc. were impacted downstream.
While these unbundled tools can get you 90% of the way to your desired end goal, you'll inevitably face a situation where your use case or SaaS tool is unsupported. In every situation like this I've ever faced, the team ultimately ends up writing and managing their own custom scripts to account for this situation. Now you have your unbundled tool + your custom script. Why not just manage all of the tools and your scripts from a singular source in the first place?
While unbundling is the reality, this new era of data technology will always still have a need for data orchestration tools that serve as a centralized view into your data workflows, whether that's Airflow or any of the new players in the space.
(Disclosure: I'm a co-founder of https://www.shipyardapp.com/, building better data orchestration for modern data teams)
No amount of tooling will make data transformation a painless process; all you end up doing is burying the business logic under so many layers of abstraction that it becomes impossible for anyone to understand.
1) Specialized tools reduce the amount of engineering overhead. As a business, I primarily care about time to value. If I can use specialized SaaS to get my data centralized, clean, and synced across my tools in a week, why would I want to spend months building all of these processes from scratch?
Sure, I lose control, visibility, and more... but I was able to deliver value 3 months ahead of schedule.
2) Existing tools like Airflow are highly technical to get started with. You can't just focus on building out scripted solutions. You have to set up and manage the infrastructure. You have to sift through the tool's documentation to understand how to effectively build DAGs. You have to inject your business logic with platform logic to make sure your code will run on Airflow.
Because the demand for data professionals is high and the supply is low, the technology ends up trying to offset the need for those highly technical skills in your organization.
There are things I like and things I don't about it. The UI is awful -- I don't know anyone that likes it, unlike what the article states. I like that it's centralized and that it's all Python code.
Deploying it and fine-tuning the config for a variety of workloads can be a pain. Sometimes sensors don't work right. Tasks sometimes get evicted and killed for obscure reasons. Zombie tasks are a pain big enough you'll see plenty of requests for help online.
That said, replacing it with a bunch of disparate tools again? Seems like a step backwards. And now instead of a single tool, your org has to vet, secure, understand and monitor a bunch of different tools? It's bad enough with only one...
What am I missing?
PS: data analysis/engineering as a field seems new and immature enough that, in my humble opinion, we should be focusing on developing good practices and theory, instead of deprecating existing (and pretty recent) tech at an ever increasing pace.
Airflow is... not amazing. But by the standards of horrible enterprise software we've all been subjected to, it's not that bad.
If you're complaining about Airflow, wait for the day you're forced to use an internally built database client.
That's Afghanistan.
Our proprietary AWS wrapper takes 45 damn minutes on a good day to allocate a VM. The AMI is built in two minutes. TWO.
I'm sure in 5 years Dagster and Prefect will have improved gradually in lots of incremental ways. For now Airflow is pretty solid.
Wait, maybe I explained myself badly: while I am complaining about some things I dislike about Airflow, at the same time I'm saying it's better than the random assortment of cron jobs we had before, and pushing back against the idea of "unbundling" it and going back to disparate tools by separate vendors.
I like writing Python code, I feel in control.
so... it's nothing more than processing plus a queue. I mean we already have rabbit and typescript. We also already have Typescript + Agenda (over mongo).
We have gotten to the point where a single company is implementing queuing at least 4 different ways because "microservices".
I disagree with you, data engineering as a field has been there for a very long time. Good practices exists and are good enough to accommodate for new ones, like MLOps and data versioning.
However for every great DE setup, you can find at least ten other that are complete pile of shit, featuring mission-critical scripted SQL reports that no one understand anymore and closed source orchestration products with millions-dollars support contrats that only one person has access to.
As always, tooling is rarely an issue. Data Engineers are rarely working on the overall "big picture" and are often given tasks without context. Embedding data engineering with product and infrastructure teams are the solution to that issue.
Its too complex to run as a single team and there are far better tools out there for scheduling. Airflow only makes sense when you need complex logic surrounding when to run jobs, how to backfill, when to backfill, and complex dependency trees. Otherwise, you are much better off with something like AWS step functions.
*We are, however, becoming more and more reliant on dbt, and the article makes a good point about Airflow providing no visibility for what's going on in a dbt node. So we're ending up with an increasingly simpler Airflow dag, with most of the complexity hidden inside a single dbt node.
We use DBT to manage the DAG for the BQ transformations, put this in a container and deploy it into the kubernetes cluster that airflow is running on as a single node.
Airflow can then handle the scheduling and DAG nodes for non DWH dependencies such as loading/checking for files, kicking off tasks that need to run after the DWH refresh and the like.
I find once it is set up it is extremely easy for small teams to follow the pattern, and the single view of all the pipelines running is a great benefit - as well as handling the logic around last successful runs etc., that would need to be implemented manually if using simple cron jobs.
If you have simple needs that are more or less set, I agree Airflow is overkill and a simple Jenkins instance is all you need.
Really? Which ones? The only thing vaguely fitting this case is Jenkins, but using Jenkins to run ETL/ELT is a serious impedance mismatch.
But yes, I'm confused. Triggering a dag and having it exit based on complex logic is a perfectly normal pattern.
The problem I’ve always had with Airflow has been with non-cron-like use cases, for example data pipelines kicked off when some event occurs. Sensors were often an awkward fit and the HTTP API was quite immature back when I was using it
Would love to have you join our beta if you are interested!
Also, I can write my T in plain ol’ SQL (granted, with some jinja) instead of this dbt-QL that I can’t copy and paste into my database console or share with a non-dbt user.
So, folks who have adopted dbt: what am I missing by being a fuddy-duddy?
It sounds like in your approach this would be writing this dependency logic into each DAG you schedule on airflow.
In the same way you would interpolate your jinja SQL before copying it into the database, you would use dbt compile or the output from a dbt run from the target/ folder and copy that SQL into your DB console or to share.
EDIT: This means your T is a single airflow node in each DAG, though I then still use airflow for the E/L tasks around it
- Automatic DAG generation based on dbt-QL declared dependencies.
- The structure of where (db/schema) and how (table/view/temporary) things are built is defined in a YAML configuration, not the individual SQL statements.
- Testing/documentation baked in.
Sure, you can manage every select statement as its own task, but it becomes pretty infeasible once things scale.
dbt can still be administered alongside all other E and L-type tasks. It's just a Python CLI wrapped around SQL SELECT statements.
As for the article, I don't think we are yet at the point in which a competing stack comprised of individual specialized components do things better since Airflow is more than the sum of its parts imho.
The Circle of Computing Complexity.
Like mentioned in this thread, managing Airflow can quickly become complicated. Its flexibility means that you can stretch Airflow in pretty interesting ways. Especially when trying to pair container orchestrators like k8s with it.
To combat that complexity and reduce the operational burden of letting a data team create & deploy batch processing pipelines we created https://github.com/orchest/orchest
We suspect that many standardized use cases (like reverse ETL) will start disappearing from custom batch pipelines. But there’s a long tail of data processing tasks for which having freedom to invoke your language of choice has significant advantages. Not to mention stimulating innovative ideas (why not use Julia for one of your processing steps?).
Thanks.
FWIW, last I looked at Airflow I thought the schedule+task model could be made tighter as their was numerous ways to enter inconsistent states. For example, changing the schedule after tasks had already been run would allow to rerun jobs (in the past) at dates that were never scheduled in the first place.
https://drive.google.com/file/d/1btZ0yck9SdgsUdNom0WXgHcSQvO...
My main gripes:
- The out of the box configuration is not something you should use in production. It's basically using python multiprocess (yikes) and sqlite like you would on a developer machine. Instead, you'll be using dedicated workers running on different machines and either a database or redis in between.
- Basically the problem is that python is single threaded (the infamous gil) and has synchronous (IO). And that kind of sucks when you are building something that ought to be asynchronous and running on multiple threads, cores, cpus, and machines. It's not a great language for that kind of job. Mostly in production it acts as a facade for stuff that is much better at such things (kubernetes, yarn, etc.).
- Most of the documentation is intended for people doing stuff on their laptops, not for people trying to actually run this in a responsible way on actual servers. In our case that meant referring to third party git repositories with misc terraform, aws, etc. setup to figure out what configuration was needed to run it in a more responsible way.
- Python developers don't seem to grasp the notion that installing a lot of python dependencies on a production server is not a very desirable thing. Doing that sucks, to put it mildly. Virtual environments help. Either way, that complicates deployment of new dags to production. That severely limits what you should be packaging up as a dag and what you should be packaging up with e.g. docker.
- What that really means is that you should be considering packaging up most of your jobs using e.g. Docker. Airflow has a docker runner and a kubernetes runner. I found using that to be a bit buggy but we managed to patch our way around it.
- Speaking of docker, at the time there was no well supported dockerized setup for Airflow. We found multiple unsupported bits of configuration for kubernetes by third parties though. That stuff looked complicated. I quickly checked and at least they now provide a docker-compose for a setup with postgresql and redis; so that's an improvement.
- The UI was actually worse than jenkins and that's a bit dated to say the least. Very web 1.0. I found my self hitting F5 a lot to make it stop lying about the state of my dags. At least Jenkins had auto reload. I assume somebody might have fixed that by now but the whole thing was pretty awful in terms of UX.
- Actual dag programming and testing was a PITA as well. And since it is python, you really do need to unit test dags before you deploy them and have them run against your production data. A small typo can really ruin your day.
We got it working in the end but it was a lot of work. I could have gotten our jobs running with jenkins in under a day easily.
I do find this pet particularly annoying, since this project sits uncomfortably between library and appliance.
In an appliance, yeah sure you can pick and lock down whatever dependencies you want. But as a library you need to be lean and hyper flexible in what’s an acceptable dependency.
Airflow invites you to put a lot of logic into what runs in their venv, which may mean your project’s dependencies must include all of theirs. Being in that state is rather unfun.
Does anyone know if the community docker images for airflow can be run using podman?
On the first one, I suspect "probably", on the second "yes".