undefined | Better HN

0 pointsJCM97mo ago0 comments

US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions. Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.

AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.

Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”

0 comments

bravetraveler7mo ago

Call me crazy, because this is, perhaps it's their "Room 641a". The purpose of a system is what it does, no point arguing 'should' against reality, etc.

They've been charging a premium for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any of the zones failing.

voxadam7mo ago

> perhaps it's their "Room 641a".

For the uninitiated: https://en.wikipedia.org/wiki/Room_641A

Anon10967mo ago

It's possible that you really could endure any zone failure. But I take these claims people make all the time with a grain of salt, unless you're working on AWS scale (basically just 3 companies) and have actually run for years and seen every kind of failure mode claiming to be higher availability is not something that's able to be accurately evaluated.

(I'm assuming by zone you mean the equivalent of an AWS region, with multiple connected datacenters)

bravetraveler7mo ago

Yes, equivalent. Did endure, repeatedly. Demonstrated to auditors to maintain compliance. They would pick the zone to cut off. We couldn't bias the test. Literal clockwork.

I'll let people guess for the sport of it, here's the hint: there were at least 30 of them comprised of Real Datacenters. Thanks for the doubt, though. Implied or otherwise.

pinkmuffinere7mo ago

Just letting you know how this response looks to other people -- Anon1096 raises legitimate objections, and their post seems very measured in their concerns, not even directly criticizing you. But your response here is very defensive, and a bit snarky. Really I don't think you even respond directly to their concerns, they say they'd want to see scale equivalent to AWS because that's the best way to see the wide variety of failure modes, but you mostly emphasize the auditors, which is good but not a replacement for the massive real load and issues that come along with it. It feels miscalibrated to Anon's comment. As a result, I actually trust you less. If you can respond to Anon's comment without being quite as sassy, I think you'd convince more people.

1 more reply

jayd167mo ago

You were in a position to actually cut off production zones with live traffic at Amazon scale and test the recovery?

1 more reply

whatever17mo ago

There are shared resources in different regions. Electricity. Cables. Common systems for coordination.

Your experiment proves nothing. Anyone can pull it off.

1 more reply

jf7mo ago

Interesting. Langley isn’t that far away

1 more reply

stronglikedan7mo ago

Was that competitor priced competitively with AWS? I think of the project management triangle here - good, fast, or cheap - pick two. AWS would be fast and cheap.

bravetraveler7mo ago

Yes, good point. Pricing is a bit higher. As another reply pointed out: there's ~three that work on the same scale. This was one, another hint I guess: it's mostly B2B. Normal people don't typically go there.

franktankbank7mo ago

I'm guessing Azure which may technically have greater resilience but has dogshit support and UX.

1 more reply

immibis7mo ago

AWS is not cheap. AWS is one to two orders of magnitude more expensive than DIY.

ranger_danger7mo ago

My $3/mo AWS instance is far cheaper than any DIY solution I could come up with, especially when I have to buy the hardware and supply the power/network/storage/physical space. Not to mention it's not worth my time to DIY something like that in the first place.

There can be other valid usecases than your own.

1 more reply

nevir7mo ago

It's really not that nefarious.

IAD datacenters have forever been the place where Amazon software developers implement services first (well before AWS was a thing).

Multi-AZ support often comes second (more than you think; Amazon is a pragmatic company), and not every service is easy to make TRULY multi-AZ.

And then other services depend on those services, and may also fall into the same trap.

...and so much of the tech/architectural debt gets concentrated into a single region.

bravetraveler7mo ago

Right, like I said: crazy. Anything production with certain other clouds must be multi-AZ. Both reinforced by culture and technical constraints. Sometimes BCDR/contract audits [zones chosen by a third party at random].

nevir7mo ago

It sure is a blast when they decide to cut off (or simulate the loss of) a whole DC just to see what breaks, I bet :)

1 more reply

xbar7mo ago

This set of facts comes to light every 3-5 years when US-East-1 has another failure. Clearly they could have architected their way out of this blast radius problem by now, but they do not.

Why would they keep a large set of centralized, core traffic services in Virginia for decades despite it being a bad design?

firesteelrain7mo ago

It’s probably because there is a lot of tech debt plus look at where it is - Virgina. It shouldn’t take much of imagination to figure out why that is strategic

dsr_7mo ago

They could put a failover site in Colorado or Seattle or Atlanta, handling just their infrastructure. It's not like the NSA wouldn't be able to backhaul from those places.

knotimpressed7mo ago

You mean the surveillance angle as reason for it being in Virginia?

1 more reply

cyberax7mo ago

AWS _had_ architected away from single-region failure modes. There are only a few services that are us-east-1 only in AWS (IAM and Route53, mostly), and even they are designed with static stability so that their control plane failure doesn't take down systems.

It's the rest of the world that has not. For a long time companies just ran everything in us-east-1 (e.g. Heroku), without even having an option to switch to another region.

JCM9OP7mo ago

So the control plane for DNS and the identity management system are tied to us-east-1 and we’re supposed to think that’s OK? Those seem like exactly the sorts of things that should NOT be reliant on only one region.

cyberax7mo ago

It's worse than that. The entire DNS ultimately depends on literally one box with the signing key for the root zone.

You eventually get services that need to be global. IAM and DNS are such examples, they have to have a global endpoint because they apply to the global entities. AWS users are not regionalized, an AWS user can use the same key/role to access resources in multiple regions.

1 more reply

MichaelZuo7mo ago

The parent seems to be implying there is something in us-east-1 that could take down all the various regions?

AtlasBarfed7mo ago

What is the motivation of an effective Monopoly to do anything?

I mean look at their console. Their console application is pretty subpar.

api7mo ago

My contention for a long time has been that cloud is full of single points of failure (and nightmarish security hazards) that are just hidden from the customer.

"We can't run things on just a box! That's a single point of failure. We're moving to cloud!"

The difference is that when the cloud goes down you can shift the blame to them, not you, and fixing it is their problem.

The corporate world is full of stuff like this. A huge role of consultants like McKinsey is to provide complicated reports and presentations backing the ideas that the CEO or other board members want to pursue. That way if things don't work out they can blame McKinsey.

raw_anon_11117mo ago

You act as if that is a bug not a feature. As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself. Besides none of your customers are going to blame you if every other major site is down.

tredre37mo ago

> As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself.

That's a very human sentiment, and I share it. That's why I don't swap my car wheels myself, I don't want to feel responsible if one comes loose on the highway and I cause an accident.

But at the same time it's also appalling how low the bar has gotten. We're still the ones deciding that one cloud is enough. The down being "their" fault really shouldn't excuse that fact. Most services aren't important enough to need redundancy. But if yours is, and it goes down because you decided that one provider is enough, then your provider isn't solely at fault here and as a profession I wish we'd take more accountability.

raw_anon_11117mo ago

How many businesses can’t afford to suffer any downtime though?

But I’ve led enough cloud implementations where I discuss the cost and complexity between - multi-AZ (it’s almost free so why not), multi region , and theoretically multi cloud (never came up in my experience) and then cold, warm and hot standby, RTO and RPO, etc

And for the most part, most businesses are fine with just multi-AZ as long as their data can survive catastrophe.

unethical_ban7mo ago

As someone who hypothetically runs a critical service, I would rather my service be up than down.

raw_anon_11117mo ago

And you have never had downtime? If your data center went down - then what?

1 more reply

aurareturn7mo ago

Or it is a matter of efficiency. If 1 million companies design and maintain their servers, there would be 1 million (or more) incidents like these. Same issues. Same fixes. Not so efficient.

SoftTalker7mo ago

It might be worse in terms of total downtime but it likely would be much less noticable as it woould be scattered individual outages not everyone at the same time.

aurareturn7mo ago

Total downtime would likely be the same or more.

gchamonlive7mo ago

Been a while since I last suffered from AWS arbitrary complexity, but afaik you can only associate certificates to cloudfront if they are generated in us-east-1, so it's undoubtedly a single point of failure for all CDN if this is still the case.

kokanee7mo ago

I worked at AMZN for a bit and the complexity is not exactly arbitrary; it's political. Engineers and managers are highly incentivized to make technical decisions based on how they affect inter-team dependencies and the related corporate dynamics. It's all about review time.

sharpy7mo ago

I have seen one promo docket get rejected for doing work that is not complex enough... I thought the problem was challenging, and the simple solution brilliant, but the tech assessor disagreed. I mean once you see there is a simple solution to a problem, it looks like the problem is simple...

bdbdkdksk7mo ago

I had a job interview like this recently: "what's the most technically complex problem you've ever worked on?"

The stuff I'm proudest of solved a problem and made money but it wasn't complicated for the sake of being complicated. It's like asking a mechanical engineer "what's the thing you've designed with the most parts"

4 more replies

gchamonlive7mo ago

That's what arbitrary means to me, but sure, I see no problem calling it political too

AtlasBarfed7mo ago

Forced attrition rears its head again

helsinkiandrew7mo ago

>US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions

I thought that if us-east-1 goes down you might not be able to administer (or bring up new services) in other zones, but if you have services running that can take over from us-east-1, you can maintain your app/website etc.

I haven’t had to do this for several years but that was my experience a few years ago on an outage - obviously it depends on the services you’re using.

You can’t start cloning things to other zones after us-east-1 is down - you’ve left it too late

sgarland7mo ago

It depends on the outage. There was one a year or two ago (I think? They run together) that impacted EC2 such that as long as you weren’t trying to scale, or issue any commands, your service would continue to operate. The EKS clusters at my job at the time kept chugging along, but had Karptenter tried to schedule more nodes, we’d have had a bad time.

bpicolo7mo ago

Static stability is a very valuable infra attribute. You should definitely consider how statically stable your services are in architecting them

yencabulator7mo ago

Meanwhile, AWS has always marketed itself as "elastic". Not being able to start new VMs in the morning to handle the daytime load will wreck many sites.

cmiles87mo ago

Well that sounds like exactly the sort of thing that shouldn’t happen when there’s an issue given the usual response is to spin things up elsewhere, especially on lower priority services where instant failover isn’t needed.

t_sawyer7mo ago

Yeah because Amazon engineers are hypocrites. They want you to spend extra money for region failover and multi-az deploys but they don't do it themselves.

falcor847mo ago

That's a good point, but I'd just s/Amazon engineers/AWS leadership/ , as I'm pretty sure that there's a few layers of management removed between the engineers on the ground at AWS, those who deprioritise any longer-term resilience work needed (which is a very strategic decisioN), and those those who are in charge of external comms/education about best practices for AWS customers.

ajsnigrutin7mo ago

Luckily, those people are the ones that will be getting all the phonecalls from angry customers here. If you're selling resilience and selling twice the service (so your company can still run if one location fails), and it still failed, well... phones will be ringing.

ajkjk7mo ago

They absolutely do do it themselves..

falcor847mo ago

What do you mean? Obviously, as TFA shows and as others here pointed-out, AWS relies globally on services that are fully-dependent on us-east-1, so they aren't fully multi-region.

ajkjk7mo ago

The claim was that that they're total hypocrites aren't multi region at all. That's totally false, the amount of redundancy in aws is staggering. But there are foundational parts which, I guess, have been too difficult to do that for (or perhaps they are redundant but the redundancy failed in this case? I dunno)

t_sawyer7mo ago

There's multiple single points of failure for their entire cloud in us-east-1.

I think it's hypocritical for them to push customers to double or triple their spend in AWS when they themselves have single points of failure on a single region.

ajkjk7mo ago

That's absurd. It's hypocritical to describe best practices as best practices because you haven't perfectly implemented them? Either they're best practice or they aren't. The customers have the option of risking non-redundancy also, you know.

1 more reply

ZeroCool2u7mo ago

They can't even bother to enable billing services in GovCloud regions.

masfuerte7mo ago

Amazon are planning to launch the EU Sovereign Cloud by the end of the year. They claim it will be completely independent. It may be possible then to have genuine resiliency on AWS. We'll see.

samcat1167mo ago

This is the difference between “partitions” and “regions”. Partitions have fully separate IAM, DNS names, etc. This is how there are things like US Gov Cloud, the Chinese AWS cloud, and now the EU sovereign cloud

JCM9OP7mo ago

Yes, although unfortunately it’s not how AWS sold regions to customers. AWS folks consistently told customers that regions were independent and customers architected on that belief.

It was only when stuff started breaking that all this crap about “well actually stuff still relies on us-east-1” starts coming out.

Agingcoder7mo ago

Yes - they told me quite specifically that until they launch their their sovereign cloud, the mothership will be around.

immibis7mo ago

Which are lies btw - Amazon has admitted the "EU sovereign cloud" is still susceptible to US government whims.

louthy7mo ago

Then it will be eu-east-1 taking down the EU

seany7mo ago

gov, iso*, cn are also already separate (unless you need to mess with your bill, or certain kinds of support tickets)

thayne7mo ago

There are hints at in their documentation. For example ACM certs for cloudfront and KMS keys for route53 DNSSEC have to be in the us-east1 region.

cyberax7mo ago

FWIW, I tried creating a DNSSEC entry for one of my domains during the outage, and it worked just fine.

everfrustrated7mo ago

However these services don't need high write uptime.

nevir7mo ago

It also doesn't help that most companies using AWS aren't remotely close to multi-region support, and that us-east-1 is likely the most populated region.

belter7mo ago

> Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.

Well it did for me today...Dont use us-east-1 explicitly just other regions and I had no outage today...( I get the point about the skeletons in the closet of us-east-1 ...maybe the power plug goes via Bezos wood desk? )

qaq7mo ago

even if us-east-1 was a normal region there is not enough spare capacity to take up all the workloads from us-east-1 in other regions so t's a moot point

Yeul7mo ago

Internet was supposed to be a communication network if the East Coast was nuked.

What it turned into was Daedalus from Deus Ex lol.

einrealist7mo ago

It sounds like they want to avoid split-brain scenarios as much as possible while sacrificing resilience. For things like DNS, this is probably unavoidable. So, not all the responsibility can be placed on AWS. If my application relies on receipts (such as an airline ticket), I should make sure I have an offline version stored on my phone so that I can still check in for my flight. But I can accept not to be able to access Reddit or order at McDonalds with my phone. And always having cash at hand is a given, although I almost always pay with my phone nowadays.

I hope they release a good root cause analysis report.

immibis7mo ago

It's not unavoidable for DNS. DNS is inherently eventually consistent anyway, due to time-based caching.

einrealist7mo ago

Sure, but you want to make sure that changes propagate as soon as possible from the central authority. And for AWS, the control plane for that authority happens to be placed in US-EAST-1. Maybe Blockchain technology can decentralize the control plane?

immibis7mo ago

Or Paxos or Raft...

j / k navigate · click thread line to collapse

0 comments

bravetraveler7mo ago

Call me crazy, because this is, perhaps it's their "Room 641a". The purpose of a system is what it does, no point arguing 'should' against reality, etc.

They've been charging a premium for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any of the zones failing.

voxadam7mo ago

> perhaps it's their "Room 641a".

For the uninitiated: https://en.wikipedia.org/wiki/Room_641A

Anon10967mo ago

(I'm assuming by zone you mean the equivalent of an AWS region, with multiple connected datacenters)

bravetraveler7mo ago

Yes, equivalent. Did endure, repeatedly. Demonstrated to auditors to maintain compliance. They would pick the zone to cut off. We couldn't bias the test. Literal clockwork.

I'll let people guess for the sport of it, here's the hint: there were at least 30 of them comprised of Real Datacenters. Thanks for the doubt, though. Implied or otherwise.

pinkmuffinere7mo ago

1 more reply

jayd167mo ago

You were in a position to actually cut off production zones with live traffic at Amazon scale and test the recovery?

1 more reply

whatever17mo ago

There are shared resources in different regions. Electricity. Cables. Common systems for coordination.

Your experiment proves nothing. Anyone can pull it off.

1 more reply

jf7mo ago

Interesting. Langley isn’t that far away

1 more reply

stronglikedan7mo ago

Was that competitor priced competitively with AWS? I think of the project management triangle here - good, fast, or cheap - pick two. AWS would be fast and cheap.

bravetraveler7mo ago

franktankbank7mo ago

I'm guessing Azure which may technically have greater resilience but has dogshit support and UX.

1 more reply

immibis7mo ago

AWS is not cheap. AWS is one to two orders of magnitude more expensive than DIY.

ranger_danger7mo ago

There can be other valid usecases than your own.

1 more reply

nevir7mo ago

It's really not that nefarious.

IAD datacenters have forever been the place where Amazon software developers implement services first (well before AWS was a thing).

Multi-AZ support often comes second (more than you think; Amazon is a pragmatic company), and not every service is easy to make TRULY multi-AZ.

And then other services depend on those services, and may also fall into the same trap.

...and so much of the tech/architectural debt gets concentrated into a single region.

bravetraveler7mo ago

nevir7mo ago

It sure is a blast when they decide to cut off (or simulate the loss of) a whole DC just to see what breaks, I bet :)

1 more reply

xbar7mo ago

This set of facts comes to light every 3-5 years when US-East-1 has another failure. Clearly they could have architected their way out of this blast radius problem by now, but they do not.

Why would they keep a large set of centralized, core traffic services in Virginia for decades despite it being a bad design?

firesteelrain7mo ago

It’s probably because there is a lot of tech debt plus look at where it is - Virgina. It shouldn’t take much of imagination to figure out why that is strategic

dsr_7mo ago

They could put a failover site in Colorado or Seattle or Atlanta, handling just their infrastructure. It's not like the NSA wouldn't be able to backhaul from those places.

knotimpressed7mo ago

You mean the surveillance angle as reason for it being in Virginia?

1 more reply

cyberax7mo ago

It's the rest of the world that has not. For a long time companies just ran everything in us-east-1 (e.g. Heroku), without even having an option to switch to another region.

JCM9OP7mo ago

cyberax7mo ago

It's worse than that. The entire DNS ultimately depends on literally one box with the signing key for the root zone.

1 more reply

MichaelZuo7mo ago

The parent seems to be implying there is something in us-east-1 that could take down all the various regions?

AtlasBarfed7mo ago

What is the motivation of an effective Monopoly to do anything?

I mean look at their console. Their console application is pretty subpar.

api7mo ago

My contention for a long time has been that cloud is full of single points of failure (and nightmarish security hazards) that are just hidden from the customer.

"We can't run things on just a box! That's a single point of failure. We're moving to cloud!"

The difference is that when the cloud goes down you can shift the blame to them, not you, and fixing it is their problem.

raw_anon_11117mo ago

tredre37mo ago

> As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself.

That's a very human sentiment, and I share it. That's why I don't swap my car wheels myself, I don't want to feel responsible if one comes loose on the highway and I cause an accident.

raw_anon_11117mo ago

How many businesses can’t afford to suffer any downtime though?

And for the most part, most businesses are fine with just multi-AZ as long as their data can survive catastrophe.

unethical_ban7mo ago

As someone who hypothetically runs a critical service, I would rather my service be up than down.

raw_anon_11117mo ago

And you have never had downtime? If your data center went down - then what?

1 more reply

aurareturn7mo ago

Or it is a matter of efficiency. If 1 million companies design and maintain their servers, there would be 1 million (or more) incidents like these. Same issues. Same fixes. Not so efficient.

SoftTalker7mo ago

It might be worse in terms of total downtime but it likely would be much less noticable as it woould be scattered individual outages not everyone at the same time.

aurareturn7mo ago

Total downtime would likely be the same or more.

gchamonlive7mo ago

kokanee7mo ago

sharpy7mo ago

bdbdkdksk7mo ago

I had a job interview like this recently: "what's the most technically complex problem you've ever worked on?"

4 more replies

gchamonlive7mo ago

That's what arbitrary means to me, but sure, I see no problem calling it political too

AtlasBarfed7mo ago

Forced attrition rears its head again

helsinkiandrew7mo ago

>US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions

I haven’t had to do this for several years but that was my experience a few years ago on an outage - obviously it depends on the services you’re using.

You can’t start cloning things to other zones after us-east-1 is down - you’ve left it too late

sgarland7mo ago

bpicolo7mo ago

Static stability is a very valuable infra attribute. You should definitely consider how statically stable your services are in architecting them

yencabulator7mo ago

Meanwhile, AWS has always marketed itself as "elastic". Not being able to start new VMs in the morning to handle the daytime load will wreck many sites.

cmiles87mo ago

t_sawyer7mo ago

Yeah because Amazon engineers are hypocrites. They want you to spend extra money for region failover and multi-az deploys but they don't do it themselves.

falcor847mo ago

ajsnigrutin7mo ago

ajkjk7mo ago

They absolutely do do it themselves..

falcor847mo ago

What do you mean? Obviously, as TFA shows and as others here pointed-out, AWS relies globally on services that are fully-dependent on us-east-1, so they aren't fully multi-region.

ajkjk7mo ago

t_sawyer7mo ago

There's multiple single points of failure for their entire cloud in us-east-1.

I think it's hypocritical for them to push customers to double or triple their spend in AWS when they themselves have single points of failure on a single region.

ajkjk7mo ago

1 more reply

ZeroCool2u7mo ago

They can't even bother to enable billing services in GovCloud regions.

masfuerte7mo ago

Amazon are planning to launch the EU Sovereign Cloud by the end of the year. They claim it will be completely independent. It may be possible then to have genuine resiliency on AWS. We'll see.

samcat1167mo ago

JCM9OP7mo ago

Yes, although unfortunately it’s not how AWS sold regions to customers. AWS folks consistently told customers that regions were independent and customers architected on that belief.

It was only when stuff started breaking that all this crap about “well actually stuff still relies on us-east-1” starts coming out.

Agingcoder7mo ago

Yes - they told me quite specifically that until they launch their their sovereign cloud, the mothership will be around.

immibis7mo ago

Which are lies btw - Amazon has admitted the "EU sovereign cloud" is still susceptible to US government whims.

louthy7mo ago

Then it will be eu-east-1 taking down the EU

seany7mo ago

gov, iso*, cn are also already separate (unless you need to mess with your bill, or certain kinds of support tickets)

thayne7mo ago

There are hints at in their documentation. For example ACM certs for cloudfront and KMS keys for route53 DNSSEC have to be in the us-east1 region.

cyberax7mo ago

FWIW, I tried creating a DNSSEC entry for one of my domains during the outage, and it worked just fine.

everfrustrated7mo ago

However these services don't need high write uptime.

nevir7mo ago

It also doesn't help that most companies using AWS aren't remotely close to multi-region support, and that us-east-1 is likely the most populated region.

belter7mo ago

> Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.

qaq7mo ago

even if us-east-1 was a normal region there is not enough spare capacity to take up all the workloads from us-east-1 in other regions so t's a moot point

Yeul7mo ago

Internet was supposed to be a communication network if the East Coast was nuked.

What it turned into was Daedalus from Deus Ex lol.

einrealist7mo ago

I hope they release a good root cause analysis report.

immibis7mo ago

It's not unavoidable for DNS. DNS is inherently eventually consistent anyway, due to time-based caching.

einrealist7mo ago

immibis7mo ago

Or Paxos or Raft...

j / k navigate · click thread line to collapse