AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.
Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”
They've been charging a premium for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any of the zones failing.
For the uninitiated: https://en.wikipedia.org/wiki/Room_641A
(I'm assuming by zone you mean the equivalent of an AWS region, with multiple connected datacenters)
I'll let people guess for the sport of it, here's the hint: there were at least 30 of them comprised of Real Datacenters. Thanks for the doubt, though. Implied or otherwise.
Your experiment proves nothing. Anyone can pull it off.
There can be other valid usecases than your own.
IAD datacenters have forever been the place where Amazon software developers implement services first (well before AWS was a thing).
Multi-AZ support often comes second (more than you think; Amazon is a pragmatic company), and not every service is easy to make TRULY multi-AZ.
And then other services depend on those services, and may also fall into the same trap.
...and so much of the tech/architectural debt gets concentrated into a single region.
Why would they keep a large set of centralized, core traffic services in Virginia for decades despite it being a bad design?
It's the rest of the world that has not. For a long time companies just ran everything in us-east-1 (e.g. Heroku), without even having an option to switch to another region.
You eventually get services that need to be global. IAM and DNS are such examples, they have to have a global endpoint because they apply to the global entities. AWS users are not regionalized, an AWS user can use the same key/role to access resources in multiple regions.
I mean look at their console. Their console application is pretty subpar.
"We can't run things on just a box! That's a single point of failure. We're moving to cloud!"
The difference is that when the cloud goes down you can shift the blame to them, not you, and fixing it is their problem.
The corporate world is full of stuff like this. A huge role of consultants like McKinsey is to provide complicated reports and presentations backing the ideas that the CEO or other board members want to pursue. That way if things don't work out they can blame McKinsey.
That's a very human sentiment, and I share it. That's why I don't swap my car wheels myself, I don't want to feel responsible if one comes loose on the highway and I cause an accident.
But at the same time it's also appalling how low the bar has gotten. We're still the ones deciding that one cloud is enough. The down being "their" fault really shouldn't excuse that fact. Most services aren't important enough to need redundancy. But if yours is, and it goes down because you decided that one provider is enough, then your provider isn't solely at fault here and as a profession I wish we'd take more accountability.
But I’ve led enough cloud implementations where I discuss the cost and complexity between - multi-AZ (it’s almost free so why not), multi region , and theoretically multi cloud (never came up in my experience) and then cold, warm and hot standby, RTO and RPO, etc
And for the most part, most businesses are fine with just multi-AZ as long as their data can survive catastrophe.
The stuff I'm proudest of solved a problem and made money but it wasn't complicated for the sake of being complicated. It's like asking a mechanical engineer "what's the thing you've designed with the most parts"
I thought that if us-east-1 goes down you might not be able to administer (or bring up new services) in other zones, but if you have services running that can take over from us-east-1, you can maintain your app/website etc.
I haven’t had to do this for several years but that was my experience a few years ago on an outage - obviously it depends on the services you’re using.
You can’t start cloning things to other zones after us-east-1 is down - you’ve left it too late
I think it's hypocritical for them to push customers to double or triple their spend in AWS when they themselves have single points of failure on a single region.
It was only when stuff started breaking that all this crap about “well actually stuff still relies on us-east-1” starts coming out.
Well it did for me today...Dont use us-east-1 explicitly just other regions and I had no outage today...( I get the point about the skeletons in the closet of us-east-1 ...maybe the power plug goes via Bezos wood desk? )
What it turned into was Daedalus from Deus Ex lol.
I hope they release a good root cause analysis report.