> the only realistic impact of picking a sub-optimal algorithm is slightly higher CPU utilization rates and background tasks taking a bit longer to finish.
At FAANG scale this couldn't be further from the truth. Let me give an example from my own work. With a tiny bit of rewording, it could cover at least two other examples I have worked on as well, at multiple companies, so I think it generalizes well enough.
A global control system needed to generate its control signals algorithmically because it had to optimize an objective worth billions per year. The dataset had grown to take over 8 hours to process, which was just barely tolerable in fair weather, it became a batch job. This was doubling every year, mind you.
Sometimes tactical changes were required on short notice, e.g. a datacenter outage required failing over to another datacenter and re-calculating the optimal way to meet those objectives, not just for resource costs but user experience. Here it was bad enough taking 8 hours, and nobody wanted to see it double in the following year.
In a few days I rewrote this tool to give the same optimal results in under 5 minutes. Now every time a tactical change is needed, it's no big deal at all. There's no doubt this saved millions already, and as the dataset continues to grow, taking 15 minutes would still be a lot better than taking 24 hours.
Another case was even more extreme but harder to compare. Re-generating all control plane data from management plane intent would have taken at least several days with the old algorithm, but this was so impractical that nobody ever did it and we never had a number for it. I made it take less than a second total, completely changing how the entire architecture, implementation, operations, observability, scale, availability, latency, etc. hinged around the new algorithm. It was the single biggest improvement to a large platform that impacted the entire company, and it's a big company.
I have several such cases. Sometimes they're offline tools that occasionally end up in the critical path of an operation during an incident, sometimes they're in the critical path of serving user requests or in a control system affecting the availability and user experience of other user-facing services. None of these were "slightly higher CPU utilization rates", they were a quantative improvement to performance so great that it made a qualitative improvement to what was possible.