We've found that BuildKit has several inefficiencies preventing it from being as fast as it could be in the cloud, especially when dealing with simultaneous builds (common in CI). That led us to create our own optimized fork of BuildKit.
The number of fine-tuning knobs you can turn running a self-hosted BuildKit instance is limitless, but I also encourage everyone to try it as a fantastic learning exercise.
Of course, CI SaaSes implement a lot of caching on their end, but they also try to put people on the most anemic machines possible to try and capture those juicy margins.
This unfortunately does not work for orgs that have, say, more than 20 engineers. The core issue is that once you have a test suite large enough to have ~30 shards, you only need one engineer `git push`ing once to saturate those 1-2 expensive machines you've got sitting in the office.
The CI workload is quite amenable to "serverless" when you get to a large enough org size, where most of the time you actually want to pay nothing (i.e. outside your business hours) but when your engineers are pushing code, you want 1500 vCPUs on-demand to run 4 or 5 test suites concurrently.
Seriously though, of course there's a lot of details here, but I think people tend to not really interenalize how much testing is about confidence, and things like incremental CI can really chew away at how big/small your test suite needs to be. There are some things that are just inherently slow, but I've seen a lot of test suites that are mostly rerunning tests that only use unchanged code for most of its runtime.
My glib assertion is that there is likely to be no test suite generated by 20 engineers that requires 30 shards that is impossible to chop up with incremental CI. And downstream of that, getting incremental CI would improve DX a lot, cuz I bet those 30 shards take a long time
Obviously the dedicated machines are not rentable per hour, but the cloud is so much more expensive.
When you build alpine packages, you literally have to call abuild on your APKBUILD files. It's the same for Arch Linux. The files are called PKGBUILD. So even if you decide to package your applications (uh, using docker run? that changes nothing!) before docker build and then install them with the OS package manager, you will run into exactly the same thing.
We also currently have some jobs that build OCI images via the Docker/Podman CLI amd build using traditional Dockerfile/Containerfile scripts. For now those are centralized and run on just one host, on bare metal. I'd like to get those working via rootless Docker-in-Docker/Podman-in-Podman, but one thing that will be a little annoying with that is that we won't have any persistent caching at the Docker/Podman layer anymore. I suppose we'll end up using something like what's in the article to get that cache persistence back.
That's a neat idea, was the primary motivation for building this out the perf gains on the table?
But as we started to mature our own CI 'infrastructure' (the automation we use to set up our self-hosted runners), I wanted to containerize the Nix builds. Using 'shell executors' in GitLab just feels icky to me, like a step backwards into Jenkins hell. Those jobs do leave a little bit more behind on disk. More importantly, though, while all of my team's Nix jobs use Nix in an ephemeral way, it is possible to run `nix profile install ...` in one of these bare metal jobs. That could potentially affect other such jobs, plus it creates a 'garbage collector root' that will reduce how much `nix-collect-garbage` can clean up a little bit. Our jobs are ones we'd like other teams across the company to run, and so we also want to provide some really low-effort ways for them to do so, namely: via shared infrastructure we host, via any Docker-capable runners they might already have, and by leveraging the same IaC we use to stand up our own runners.
To that end, we really want to have just one type of job that requires just one type of execution environment, and we definitely want opt-in persistence instead of a mess where jobs can very easily influence one another by accident or malice. But we don't want to lose the speedup! The real action in these jobs is small, so by sharing a persistent Nix store between runs, they go down from 2-10 minutes to 2-10 seconds, which is the kind of UX we want for our internal customers.
The new Nix image is more suitable for all three target scenarios: it's less risky on runner hosts shared by multiple teams, it still works normally (downloading deps via Nix on every run) on 'naive' Docker/Podman setups, and our runner initialization script actually uses Nix to provide Docker and Podman (both rootless), so any team can use it on top of whatever VM images they're already using for their CI runners regardless of distro or version once they're ready to opt into that performance optimization.
[1]: https://earthly.dev/
This should be protected with mTLS (https://docs.docker.com/build/drivers/remote/) or SSH (`endpoint: ssh://user@host`) to avoid potential cryptomining attack, etc.
I know it is out of style for some, but my microservice architecture, which has a dozen services, each takes about 1:30m to build, maybe 2m at most (if there is a slow Next.js build in there and a few thousand npm packages), and that is just on a 4 core GitHub Actions worker.
My microservices all build and deploy in parallel so this system doesn't get slower as you expand to more services.
(Open source template which shows how it works: https://github.com/bhouston/template-typescript-monorepo/act... )
If you're deploying all your "microservices" in parallel, then what you might have built is a distributed monolith.
A microservice can be tested and deployed independently.
Spinning a build worker outright when a change us pushed is the fastest way, and may be expensive if the build process is prolonged.
OTOH I've seen much faster image build times, with smart reuse of layers, so that you don't have to re-run that huge npm install if your packages.lock did not change.
At Blacksmith we do see this pretty often! Rust services in particular are the most common offender.
As a side note: In my time running a CI infra co, we see that a majority of the workflow time for large teams comes from tests - which can have over 200 shards in some cases.
What’s the most common cause of builds taking this long in the first place…
Worst I have ever had was 5 minutes, but subsequent builds were reduced to under a minute due to build cache, creating multi-stage builds, and keeping the layers thin and optimizing the .dockerignore