- Configured a block-level in-memory disk accelerator / cache (fs operations at the speed of RAM!)
- Benchmarked EC2 instance types (m7a is the best x86 today, m8g is the best arm64)
- "Warming" the root EBS volume by accessing a set of priority blocks before the job starts to give the job full disk performance [0]
- Launching each runner instance in a public subnet with a public IP - the runner gets full throughput from AWS to the public internet, and IP-based rate limits rarely apply (Docker Hub)
- Configuring Docker with containerd/estargz support
- Just generally turning kernel options and unit files off that aren't needed
[0] https://docs.aws.amazon.com/ebs/latest/userguide/ebs-initial...
Are you not using a caching registry mirror, instead pulling the same image from Hub for each runner...? If so that seems like it would be an easy win to add, unless you specifically do mostly hot/unique pulls.
The more efficient answer to those rate limits is almost always to pull less times for the same work rather than scaling in a way that circumvents them.
From a performance / efficiency perspective, we generally recommend using ECR Public images[0], since AWS hosts mirrors of all the "Docker official" images, and throughput to ECR Public is great from inside AWS.
I'm slightly old; is that the same thing as a ramdisk? https://en.wikipedia.org/wiki/RAM_drive
Everyone Linux kernel does that already. I currently have 20 GB of disk cached in RAM on this laptop.
If you corrupt a CI node, whatever. Just rerun the step
edit: Or, even easier, just use the pre-built fail_function infrastructure (with retval = 0 instead of an error): https://docs.kernel.org/fault-injection/fault-injection.html
Actually in my experience with pulling very large images to run with docker it turns out that Docker doesn't really do any fsync-ing itself. The sync happens when it creates an overlayfs mount while creating a container because the overlayfs driver in the kernel does it.
A volatile flag to the kernel driver was added a while back, but I don't think Docker uses it yet https://www.redhat.com/en/blog/container-volatile-overlay-mo...
Other options are to use an overlay mount with volatile or ext4 with nobarrier and writeback.
Why in the world does it do that ????
Ok I googled (kagi). Same reason anyone ever does: pure voodoo.
If you can't trust the kernel to close() then you can't trust it to fsync() or anything else either.
Kernel-level crashes, the only kind of crash that risks half-written files, are no more likely during dpkg than any other time. A bad update is the same bad update regardless, no better, no worse.
> If you can't trust the kernel to close() then you can't trust it to fsync() or anything else either.
https://man7.org/linux/man-pages/man2/close.2.html
A successful close does not guarantee that the data has been
successfully saved to disk, as the kernel uses the buffer cache to
defer writes. Typically, filesystems do not flush buffers when a
file is closed. If you need to be sure that the data is
physically stored on the underlying disk, use fsync(2). (It will
depend on the disk hardware at this point.)
So if you want to wait until it's been saved to disk, you have to do an fsync first. If you even just want to know if it succeeded or failed, you have to do an fsync first.Of course none of this matters much on an ephemeral Github Actions VM. There's no "on next boot or whatever". So this is one environment where it makes sense to bypass all this careful durability work that I'd normally be totally behind. It seems reasonable enough to say it's reached the page cache, it should continue being visible in the current boot, and tomorrow will never come.
You can get half-written files in many other circumstances, eg on power outages, storage failures, hw caused crashes, dirty shutdowns, and filesystem corruption/bugs.
(Nitpick: trusting the kernel to close() doesn't have anythign to do with this, like a sibling comment says)
and about kernel-level crashes: yes, but you see, dpkg creates a new file on the disk, makes sure it is written correctly with fsync() and then calls rename() (or something like that) to atomically replace old file with new one.
So there is never a possibility of given file being corrupt during update.
Maybe they know something you don't ?????
Imagine this scenario; you're writing a CI pipeline:
1. You write some script to `apt-get install` blah blah
2. As soon as the script is done, your CI job finishes.
3. Your job is finished, so the VM is powered off.
4. The hypervisor hits the power button but, oops, the VM still had dirty disk cache/pending writes.
The hypervisor may immediately pull the power (chaos monkey style; developers don't have patience), in which case those writes are now corrupted. Or, it may use ACPI shutdown which then should also have an ultimate timeout before pulling power (otherwise stalled IO might prevent resources from ever being cleaned up).
If you rely on sync to occur at step 4 during the kernel to gracefully exit, how long does the kernel wait before it decides that some shutdown-timeout occurred? How long does the hypervisor wait and is it longer than the kernel would wait? Are you even sure that the VM shutdown command you're sending is the graceful one?
How would you fsync at step 3?
For step 2, perhaps you might have an exit script that calls `fsync`.
For step 1, perhaps you might call `fsync` after `apt-get install` is done.
(to be clear: my comment is sarcasm and web scale is a reference to a joke about reliability [0])
If you want to truly speed up builds by optimizing disk performance, there are no shortcuts to physically attaching NVMe storage with high throughput and high IOPS to your compute directly.
That's what we do at WarpBuild[0] and we outperform Depot runners handily. This is because we do not use network attached disks which come with relatively higher latency. Our runners are also coupled with faster processors.
I love the Depot content team though, it does a lot of heavy lifting.
Trading Strategy looks super cool, by the way.
[1]: https://runs-on.com/benchmarks/github-actions-disk-performan...
Are there any reasonable alternatives for a really tiny FOSS project?
you can check us out at https://yeet.cx
we also have a anonymous guest sandbox you can play with