AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.
Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”
They've been charging a premium for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any of the zones failing.
For the uninitiated: https://en.wikipedia.org/wiki/Room_641A
(I'm assuming by zone you mean the equivalent of an AWS region, with multiple connected datacenters)
I'll let people guess for the sport of it, here's the hint: there were at least 30 of them comprised of Real Datacenters. Thanks for the doubt, though. Implied or otherwise.
IAD datacenters have forever been the place where Amazon software developers implement services first (well before AWS was a thing).
Multi-AZ support often comes second (more than you think; Amazon is a pragmatic company), and not every service is easy to make TRULY multi-AZ.
And then other services depend on those services, and may also fall into the same trap.
...and so much of the tech/architectural debt gets concentrated into a single region.
Why would they keep a large set of centralized, core traffic services in Virginia for decades despite it being a bad design?
It's the rest of the world that has not. For a long time companies just ran everything in us-east-1 (e.g. Heroku), without even having an option to switch to another region.
I mean look at their console. Their console application is pretty subpar.
"We can't run things on just a box! That's a single point of failure. We're moving to cloud!"
The difference is that when the cloud goes down you can shift the blame to them, not you, and fixing it is their problem.
The corporate world is full of stuff like this. A huge role of consultants like McKinsey is to provide complicated reports and presentations backing the ideas that the CEO or other board members want to pursue. That way if things don't work out they can blame McKinsey.
That's a very human sentiment, and I share it. That's why I don't swap my car wheels myself, I don't want to feel responsible if one comes loose on the highway and I cause an accident.
But at the same time it's also appalling how low the bar has gotten. We're still the ones deciding that one cloud is enough. The down being "their" fault really shouldn't excuse that fact. Most services aren't important enough to need redundancy. But if yours is, and it goes down because you decided that one provider is enough, then your provider isn't solely at fault here and as a profession I wish we'd take more accountability.
I thought that if us-east-1 goes down you might not be able to administer (or bring up new services) in other zones, but if you have services running that can take over from us-east-1, you can maintain your app/website etc.
I haven’t had to do this for several years but that was my experience a few years ago on an outage - obviously it depends on the services you’re using.
You can’t start cloning things to other zones after us-east-1 is down - you’ve left it too late
I think it's hypocritical for them to push customers to double or triple their spend in AWS when they themselves have single points of failure on a single region.
It was only when stuff started breaking that all this crap about “well actually stuff still relies on us-east-1” starts coming out.
Well it did for me today...Dont use us-east-1 explicitly just other regions and I had no outage today...( I get the point about the skeletons in the closet of us-east-1 ...maybe the power plug goes via Bezos wood desk? )
What it turned into was Daedalus from Deus Ex lol.
I hope they release a good root cause analysis report.