The only people who should be shocked in this thread are the people who have been hoodwinked into thinking operations is so hard you need thousands of staff. I know AWS/GCP/Azure like to charge us as if we were hiring an army of sysadmins, but the truth is that day-to-day DC ops does not require so many people. Hardware failures are more rare than you think and you can work around them without panicking anyway.
In this case they can't even get blamed for their vendor choice because both AWS and Azure are now so big that they're in "nobody ever got fired for buying IBM" territory.
I don’t think AWS is blowing up a vendors phone when something goes wrong in one of their facilities.
[1] https://www.datacenterknowledge.com/archives/2017/04/07/how-...
Bear in mind that outside of the US and maybe one or two other locations ex-US, almost all of the magic cloud operates out of third-party datacentres, not their own.
They will have a small office on-site where 3–5 people sit, and those people are exclusively dedicated to the cloud equipment itself. The datacentre ops side is, by definition, handled by the third-party datacentre operator.
The guys onsite are clearly only there for "intelligent hands" purposes, as everything else will be done remotely from silicon valley or wherever.
Across all the datacenters the number of operations personnel likely exceeds 100. Think of the unit of scale as a datacenter, with an availability zone potentially containing 10+ of those.
[1] https://www.datacenterfrontier.com/cloud/article/11427911/aw...
Larger DCs can and do have more staff on-site 24/7 and typically the amount of staff on-site at any given time is driven by SLAs.
I expect the DC in TFA to return to lower staff levels once they've worked on reducing their total "time to restart chiller" or reduced the amount of manual work involved in doing so.
That said I have no idea. When I worked (many many many many years ago) in a small DC that is perhaps the size of a 2bed apartment we had 4 guys scurrying about doing stuff (hands-on-keyboard, routing cables, replacing hardware etc). This was way before Docker & Kubernetes et al - physical iron and all that. I would assume that in modern DC ops you could run a football field sized DC with less than 10 people due to automation. But that said if part of the actual infrastructure like power or cooling fails, you need to have the right skill-set in place. If the cooler's failed and couldn't just be turned off and on again, we would have been out of luck in my old DC days and would need to call someone in and just hope the servers didn't fry in the meantime. Sounds like a similar deal here.
They go on site to geographically adjacent DCs and outside that just travel onsite for special projects.
I've slept so much better since I began hosting, producing energy, and cooling on-prem
For those that do need to use cloud, just make sure you are running your services in different failure zones.
By which time you might as well just roll out your own kit in colocation or your own datacentres.
The cloud providers are nickle and dimers, they charge you for every little tiny thing.
Cloud might look cheap at cents-per-hour, but then you find you need X "services" to deliver your Service and so you are talking about exponential cents-per-hour (X cloud services times x cents-per-hour).
And then running your services across failure zones will of course cost you more beyond the basic double-cost, because most cloud providers charge by the GB for cross-zone traffic. So if you're doing cross-zone replication, that's gonna cost you a pretty penny.
Meanwhile, in your own colo/DC, you have predictable costs. And you can get redundant connections between sites for a flat rate, not some stupid per GB fee.
People talk about this often but this failure mode seems to never happen? When was the last time us-east-1 went down because of a natural calamity compared to some technical issue?
I would be very, very surprised if the companies mentioned, in particular banks, weren't running on multiple AZs, but I wouldn't be surprised if the scenario of severing down an AZ was not tested.
It is.
The cloud fanbois will tell you until their blue in the face that its not.
I fully accept that the cloud is great for bursty workloads where you're doing nothing and then suddenly half the planet needs your service for a couple of days. That is clear.
But if you've got a reasonably stable baseload running 24x7x365 and a few modest bursts here and there then honestly people need to do the math, because if you look at beyond the short-term figures, the cloud tends to work out much more expensive than colo if you look at for example a three-year period.
Most people don't need the scale the cloud gives. They think they do, but really most people will never grow to FANG scale as much as they may dream it !