Secondly if you're actually trying to startup 100+ test workers per build and so on there's going to be some time distribution for how long it takes for each worker to startup and that adds a bit more time for all workers to complete. This distribution probably isn't _that_ wide timewise but if you really start to push your test suite runtime down it may pop up. If you're running things in docker sometimes a node doesn't have the image in it's docker cache...
Unsure if CI services like buildkite have really made this that much faster but it seems like they are using a single box with 64 cores.