Microsoft had three staff at Australian data centre campus when Azure went out (opens in new tab)

(itnews.com.au)

89 pointspophenat2y ago49 comments

49 comments

I guess people really have forgotten how to run datacenters.

The only people who should be shocked in this thread are the people who have been hoodwinked into thinking operations is so hard you need thousands of staff. I know AWS/GCP/Azure like to charge us as if we were hiring an army of sysadmins, but the truth is that day-to-day DC ops does not require so many people. Hardware failures are more rare than you think and you can work around them without panicking anyway.

siva72y ago

You actually need thousands of people for operations at major cloud provider scale (and much more for development), but it scales at some point where you only need people for hands-on tasks at satellite plants and the rest sits at HQ.

wkat42422y ago

Or rather at low wage locations like India :)

benterix2y ago

The management's way of thinking is: "Well, let's just pay for the peace of mind." Except that this famous peace of mind never comes, because the cloud gets more and more complex each year and it's hard to keep up. Heck, even Amazon can't keep up: for example, officially they depreciate bucket policies but internally they are using it for example in the Cloud Formation templates for the Control Tower. But now it's too late to go back as most of the internet is running on the three major public clouds. You need a lot of determination and a good plan to free oneself from vendor lock-in. In larger orgs it's practically impossible.

adambatkin2y ago

I don't believe that S3 Bucket Policies are deprecated. They are powerful, effective, and consistent with almost everything else at AWS (Resource Policy). Perhaps you are thinking of ACLs?

1 more reply

wkat42422y ago

This peace of mind is also outsourcing responsibility. Having someone else to point to when shit hits the fan is very valuable for a manager.

In this case they can't even get blamed for their vendor choice because both AWS and Azure are now so big that they're in "nobody ever got fired for buying IBM" territory.

fsociety2y ago

Even within a single AWS region you can land in completely different data centers. Perhaps it doesn’t require as many people as some think, but the large cloud businesses run at a larger scale than your mom-and-pop data center.

paulddraper2y ago

But what about the hundreds of local jobs they promise?

wkat42422y ago

They're temporary initial construction jobs and a few low skilled remote hands running around.

1 more reply

WaitWaitWha2y ago

Three to five on-site staff to operate a mid-sized DC (10MW/~1K rack yield) is not unusual. This is assuming there are several others on-call.

dgrin912y ago

I know very little on the datacenter operations side of things - I guess 3 people is not a lot, but what is normal? How many operations people are at say AWS US-East-1? I presume it doesn't scale with number of servers, that would not scale well. What is a 'normal' level? 10? 100? It can't be more than 100, can it?

dmw_ng2y ago

I doubt any amount of staffing could address lack of specialism in dealing with power or air conditioning issues, both likely involve infrequent maintenance by external vendors. 20 people blowing up the phone to a vendor doesn't fix a problem any faster than 1 etc.

nixgeek2y ago

Amazon goes to the extreme of putting its own custom firmware on switchgear because the choices that vendor makes in theirs doesn’t align with their objectives.

I don’t think AWS is blowing up a vendors phone when something goes wrong in one of their facilities.

[1] https://www.datacenterknowledge.com/archives/2017/04/07/how-...

2 more replies

traceroute662y ago

> I know very little on the datacenter operations side of things - I guess 3 people is not a lot, but what is normal?

Bear in mind that outside of the US and maybe one or two other locations ex-US, almost all of the magic cloud operates out of third-party datacentres, not their own.

They will have a small office on-site where 3–5 people sit, and those people are exclusively dedicated to the cloud equipment itself. The datacentre ops side is, by definition, handled by the third-party datacentre operator.

The guys onsite are clearly only there for "intelligent hands" purposes, as everything else will be done remotely from silicon valley or wherever.

nixgeek2y ago

us-east-1 is many, many datacenters across its Availability Zones and tens of billions of investment for Amazon Web Services.

Across all the datacenters the number of operations personnel likely exceeds 100. Think of the unit of scale as a datacenter, with an availability zone potentially containing 10+ of those.

[1] https://www.datacenterfrontier.com/cloud/article/11427911/aw...

22c2y ago

0 - 2 staff in a typical DC is not unusual at all, with people who are on-call usually within a 30 minute drive.

Larger DCs can and do have more staff on-site 24/7 and typically the amount of staff on-site at any given time is driven by SLAs.

I expect the DC in TFA to return to lower staff levels once they've worked on reducing their total "time to restart chiller" or reduced the amount of manual work involved in doing so.

mtmail2y ago

Still we read how DCs are a job generator. E.g. "One hopes for hundreds of jobs for locals." https://cryptoquorum.com/oman-opens-cryptocurrency-mining-ce...

3 more replies

mattlondon2y ago

I've heard of the big-cos having to use bonafide robots for doing manual tasks in a data center like replacing broken drives or swapping tapes etc. I think there is still a bunch of manual tasks to be done.

That said I have no idea. When I worked (many many many many years ago) in a small DC that is perhaps the size of a 2bed apartment we had 4 guys scurrying about doing stuff (hands-on-keyboard, routing cables, replacing hardware etc). This was way before Docker & Kubernetes et al - physical iron and all that. I would assume that in modern DC ops you could run a football field sized DC with less than 10 people due to automation. But that said if part of the actual infrastructure like power or cooling fails, you need to have the right skill-set in place. If the cooler's failed and couldn't just be turned off and on again, we would have been out of luck in my old DC days and would need to call someone in and just hope the servers didn't fry in the meantime. Sounds like a similar deal here.

xulres2y ago

Smaller operations usually need more manpower.

patmorgan232y ago

You don't need many on-site staff. 99% of tasks can be performed remotely. The only ones that can't involve physically moving equipment, which doesn't happen that often. And if you're doing a big build out you can bring in extra staff for the couple of weeks that takes.

walth2y ago

Over 10k servers here, a couple dozen locations scattered around the globe. One full time operations person.

They go on site to geographically adjacent DCs and outside that just travel onsite for special projects.

collaborative2y ago

Guessing affected customers had to spend time and effort on top of ongoing high cloud bills

I've slept so much better since I began hosting, producing energy, and cooling on-prem

mattlondon2y ago

The secret is hosting across failure boundaries so that a single outage like this does not impact you. Self-hosting is fine if you can afford the capex for two physically separate data centers (like really separate - like 100+ miles etc (or more!) to cope with natural disasters) and the staff to operate & maintain them 24/7. For many, this is not realistic.

For those that do need to use cloud, just make sure you are running your services in different failure zones.

traceroute662y ago

> For those that do need to use cloud, just make sure you are running your services in different failure zones.

By which time you might as well just roll out your own kit in colocation or your own datacentres.

The cloud providers are nickle and dimers, they charge you for every little tiny thing.

Cloud might look cheap at cents-per-hour, but then you find you need X "services" to deliver your Service and so you are talking about exponential cents-per-hour (X cloud services times x cents-per-hour).

And then running your services across failure zones will of course cost you more beyond the basic double-cost, because most cloud providers charge by the GB for cross-zone traffic. So if you're doing cross-zone replication, that's gonna cost you a pretty penny.

Meanwhile, in your own colo/DC, you have predictable costs. And you can get redundant connections between sites for a flat rate, not some stupid per GB fee.

gruez2y ago

>like 100+ miles etc (or more!) to cope with natural disasters)

People talk about this often but this failure mode seems to never happen? When was the last time us-east-1 went down because of a natural calamity compared to some technical issue?

2 more replies

AugustoCAS2y ago

Fully agree on this, plus (a very important plus) test that severing down an AZ doesn't bring the services on the good AZ down too. And test this frequently.

I would be very, very surprised if the companies mentioned, in particular banks, weren't running on multiple AZs, but I wouldn't be surprised if the scenario of severing down an AZ was not tested.

haolez2y ago

What about data center colocation? When you simply rent the energy, cooling, etc, but the hardware is yours? Do you think it's a nice middle ground?

traceroute662y ago

> Do you think it's a nice middle ground?

It is.

The cloud fanbois will tell you until their blue in the face that its not.

I fully accept that the cloud is great for bursty workloads where you're doing nothing and then suddenly half the planet needs your service for a couple of days. That is clear.

But if you've got a reasonably stable baseload running 24x7x365 and a few modest bursts here and there then honestly people need to do the math, because if you look at beyond the short-term figures, the cloud tends to work out much more expensive than colo if you look at for example a three-year period.

Most people don't need the scale the cloud gives. They think they do, but really most people will never grow to FANG scale as much as they may dream it !

2 more replies

PeterStuer2y ago

I wonder what a 4th, 5th or 10th person onsite could have done to speed up mitigation and recovery.

AtNightWeCode2y ago

I straight out think MS lied when they said this was an .au only issue. We had a surge of rouge traffic from MS during this issue and we are pretty much on the other side of the globe.

svaha17282y ago

It’s Microsoft. I’m kinda surprised it’s not ChatGPT-REPL at this point.

j / k navigate · click thread line to collapse

49 comments

dijit2y ago

I guess people really have forgotten how to run datacenters.

siva72y ago

wkat42422y ago

Or rather at low wage locations like India :)

benterix2y ago

adambatkin2y ago

I don't believe that S3 Bucket Policies are deprecated. They are powerful, effective, and consistent with almost everything else at AWS (Resource Policy). Perhaps you are thinking of ACLs?

1 more reply

wkat42422y ago

This peace of mind is also outsourcing responsibility. Having someone else to point to when shit hits the fan is very valuable for a manager.

In this case they can't even get blamed for their vendor choice because both AWS and Azure are now so big that they're in "nobody ever got fired for buying IBM" territory.

fsociety2y ago

paulddraper2y ago

But what about the hundreds of local jobs they promise?

wkat42422y ago

They're temporary initial construction jobs and a few low skilled remote hands running around.

1 more reply

WaitWaitWha2y ago

Three to five on-site staff to operate a mid-sized DC (10MW/~1K rack yield) is not unusual. This is assuming there are several others on-call.

dgrin912y ago

dmw_ng2y ago

nixgeek2y ago

Amazon goes to the extreme of putting its own custom firmware on switchgear because the choices that vendor makes in theirs doesn’t align with their objectives.

I don’t think AWS is blowing up a vendors phone when something goes wrong in one of their facilities.

[1] https://www.datacenterknowledge.com/archives/2017/04/07/how-...

2 more replies

traceroute662y ago

> I know very little on the datacenter operations side of things - I guess 3 people is not a lot, but what is normal?

Bear in mind that outside of the US and maybe one or two other locations ex-US, almost all of the magic cloud operates out of third-party datacentres, not their own.

The guys onsite are clearly only there for "intelligent hands" purposes, as everything else will be done remotely from silicon valley or wherever.

nixgeek2y ago

us-east-1 is many, many datacenters across its Availability Zones and tens of billions of investment for Amazon Web Services.

Across all the datacenters the number of operations personnel likely exceeds 100. Think of the unit of scale as a datacenter, with an availability zone potentially containing 10+ of those.

[1] https://www.datacenterfrontier.com/cloud/article/11427911/aw...

22c2y ago

0 - 2 staff in a typical DC is not unusual at all, with people who are on-call usually within a 30 minute drive.

Larger DCs can and do have more staff on-site 24/7 and typically the amount of staff on-site at any given time is driven by SLAs.

I expect the DC in TFA to return to lower staff levels once they've worked on reducing their total "time to restart chiller" or reduced the amount of manual work involved in doing so.

mtmail2y ago

Still we read how DCs are a job generator. E.g. "One hopes for hundreds of jobs for locals." https://cryptoquorum.com/oman-opens-cryptocurrency-mining-ce...

3 more replies

mattlondon2y ago

xulres2y ago

Smaller operations usually need more manpower.

patmorgan232y ago

walth2y ago

Over 10k servers here, a couple dozen locations scattered around the globe. One full time operations person.

They go on site to geographically adjacent DCs and outside that just travel onsite for special projects.

collaborative2y ago

Guessing affected customers had to spend time and effort on top of ongoing high cloud bills

I've slept so much better since I began hosting, producing energy, and cooling on-prem

mattlondon2y ago

For those that do need to use cloud, just make sure you are running your services in different failure zones.

traceroute662y ago

> For those that do need to use cloud, just make sure you are running your services in different failure zones.

By which time you might as well just roll out your own kit in colocation or your own datacentres.

The cloud providers are nickle and dimers, they charge you for every little tiny thing.

Meanwhile, in your own colo/DC, you have predictable costs. And you can get redundant connections between sites for a flat rate, not some stupid per GB fee.

gruez2y ago

>like 100+ miles etc (or more!) to cope with natural disasters)

People talk about this often but this failure mode seems to never happen? When was the last time us-east-1 went down because of a natural calamity compared to some technical issue?

2 more replies

AugustoCAS2y ago

Fully agree on this, plus (a very important plus) test that severing down an AZ doesn't bring the services on the good AZ down too. And test this frequently.

I would be very, very surprised if the companies mentioned, in particular banks, weren't running on multiple AZs, but I wouldn't be surprised if the scenario of severing down an AZ was not tested.

haolez2y ago

What about data center colocation? When you simply rent the energy, cooling, etc, but the hardware is yours? Do you think it's a nice middle ground?

traceroute662y ago

> Do you think it's a nice middle ground?

It is.

The cloud fanbois will tell you until their blue in the face that its not.

I fully accept that the cloud is great for bursty workloads where you're doing nothing and then suddenly half the planet needs your service for a couple of days. That is clear.

Most people don't need the scale the cloud gives. They think they do, but really most people will never grow to FANG scale as much as they may dream it !

2 more replies

PeterStuer2y ago

I wonder what a 4th, 5th or 10th person onsite could have done to speed up mitigation and recovery.

AtNightWeCode2y ago

I straight out think MS lied when they said this was an .au only issue. We had a surge of rouge traffic from MS during this issue and we are pretty much on the other side of the globe.

svaha17282y ago

It’s Microsoft. I’m kinda surprised it’s not ChatGPT-REPL at this point.

j / k navigate · click thread line to collapse