undefined | Better HN

0 pointsaltbdoor7mo ago0 comments

Had a meeting where developers were discussing the infrastructure for an application. A crucial part of the whole flow was completely dependant on an AWS service. I asked if it was a single point of failure. The whole room laughed, I rest my case.

0 comments

aeve8907mo ago

Similar experience here. People laughed and some said something like "well, if something like AWS falls then we have bigger problems". They laugh because honestly is too far-fetched to think the whole AWS infra going down. Too big to fail as they say in the US. Nothing short of a nuclear war would fuck up the entire AWS network so they're kinda right.

Until this happen. A single region in a cascade failure and your saas is single region.

stephenlf7mo ago

They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.

oceanplexian7mo ago

Why would your competitors go down? AWS has at best 30-35% market share. And that's ignoring the huge mass of companies who still run their infrastructure on bare metal.

CharlieDigital7mo ago

A whole bunch of meeting bots use Recall.

Recall is on AWS.

Everyone using Recall for meeting recordings is down.

In some domains, a single SaaS dominates the domain and if that SaaS sits on AWS, it doesn't matter if AWS is 35% marketshare because the SaaS that dominates 80% of the domain is on AWS so the effect is wider than just AWS's market share.

We're on GCP, but we have various SaaS vendors on AWS so any of the services that rely on AWS are gone.

Many chat/meeting services also run on AWS Chime so even if you're not on AWS, if a vendor uses Chime, that service is down.

1 more reply

tetha7mo ago

Part of the company I work at is doing infrastructure consulting. We're in fact seeing companies moving to bare metal, with the rise of turnkey container systems between Nutanix, Purestorage, Redhat, ... At this point in time, a few remotely managed boxes in a rack can offer a really good experience for containers for very little effort.

And this comes in a time with regulations like Dora and the BaFin tightening things - managing these boxes becomes less effort than maintaining compliance across vendors.

1 more reply

daheza7mo ago

Because your competitor probably depends on a service which uses aws. They may host all their stuff in azure, but use cloudfront as cache which uses aws and goes down.

codeduck7mo ago

because your competitors are probably using services that depend on AWS.

palmotea7mo ago

>> People laughed and some said something like "well, if something like AWS falls then we have bigger problems".

> They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.

They made their own bigger problems by all crowding into the same single region.

pluto_modadic7mo ago

it's a weird effect:

imagine a beach, with icecream vendors. You'd think it would be optimal for two vendors to each split it half north, half south. However, in wanting to steal some of the other vendors' customers, you end up with two icecream stands in the center.

So too with outages. Safety / loss of blame in numbers.

immibis7mo ago

Our app was up. I'm sure we made a lot of money.

joshstrange7mo ago

The question really becomes, did you make money that you wouldn't have made when services came back up? As in, will people just shift their purchase time to tomorrow when you are back online? Sure, some % is completely lost but you have to weigh that lost amount against the ongoing costs to be multi-cloud (or multi-provider) and the development time against those costs. For most people I think it's cheaper to just be down for a few hours. Yes, this outage is longer than any I can remember but most people will shrug it off and move on once it comes back up fully.

At the end of the day most of us aren't working on super critical things. No one is dying because they can't purchase X item online or use Y SaaS. And, more importantly, customers are _not_ willing to pay the extra for you to host your backend in multiple regions/providers.

In my contracts (for my personal company) I call out the single-point-of-failure very clearly and I've never had anyone balk. If they did I'd offer then resiliency (for a price) and I have no doubt that they would opt to "roll the dice" instead of pay.

Lastly, it's near-impossible to verify what all your vendors are using so even if you manage to get everything resilient it only takes one chink in the armor the bring it all down (See: us-east-1 and various AWS services that rely on that even if you don't host anything in us-east-1 directly).

I'm not trying to downplay this, pretend it doesn't matter, or anything like that. Just trying to point out that most people don't care because no one seems to care (or want to pay for it). I wish that was different (I wish a lot of things were different) but wishing doesn't pay my bills and so if customers don't want to pay for resiliency then this is what they get and I'm at peace with that.

baobabKoodaa7mo ago

Plenty of stuff still works.

swat5357mo ago

Sure but to their point, you're off the hook if half of the internet is down.. it's sort of like: "No one gets fired for picking IBM".

1 more reply

lxgr7mo ago

Nothing short of a nuclear war, a bad deploy, or some operational oopsie, and everybody knows how rare all these things are!

Elidrake247mo ago

If you were dependent upon a single distribution (region) of that Service, yes it would be a massive single point of failure in this case. If you weren't dependent upon a particular region, you'd be fine.

zimbu6687mo ago

Of course lots of AWS services have hidden dependencies on us-east-1. During a previous outage we needed to update a Route53(DNS) record in us-west-2, but couldn't because of the outage in us-east-1.

ineedasername7mo ago

So, AWS's redundant availability goes something like "Don't worry, if nothing is working in us-east-1, it will trigger failover to another regions" ... "Okay, where's that trigger located?" ... "In the us-east-1 region also" ... "Doens't that seem a problem to you?" ... "You'd think it might be! But our logs say it's never been used."

ta12437mo ago

Relying on AWS is a single point of failure. Not as much as relying on a single AWS region, but it's still a single point.

It's fairly difficult to avoid single points of failure completely, and if you do it's likely your suppliers and customers haven't managed to.

It's about how much your risk level is.

AWS us-east-1 fails constantly, it has terrible uptime, and you should expect it to go. A cyberattack which destroyed AWSs entire infrastructure would be less likely. BGP hijacks across multiple AWS nodes are quite plausible though, but that can be mitigated to an extent with direct connects.

Sadly it seems people in charge of critical infrastructure don't even bother thinking about these things, because next quarters numbers are more important.

I can avoid London as a single point of failure, but the loss of Docklands would cause so much damage to the UK's infrastructure I can't confidently predict that my servers in Manchester connected to peering points such as IXman will be able to reach my customer in Norwich. I'm not even sure how much international connectivity I could rely on. In theory Starlink will continue to work, but in practice I'm not confident.

When we had power issues in Washington DC a couple of months ago, three of our four independent ISPS failed, as they all had undeclared requirements on active equipment in the area. That wasn't even a major outage, just a local substation failure. The one circuit which survived was clearly just fibre from our (UPS/generator backed) equipment room to a data centre towards Baltimore (not Ashburn).

wubrr7mo ago

Some 'regional' AWS services still rely on other services (some internal) that are only in us-east-1.

antinomicus7mo ago

Even Amazon’s own services (ie ring) were affected by this outage

dvsgaevsvsgavsv7mo ago

Amazing. So you will build your own load balancer that sends loads between AWS and Gcloud and make it the single point of failure instewd?

steveBK1237mo ago

I mean given what we've seen with these AWS failures impact, wouldn't any enemies first target be to hit us-east-1 ? Imagine if it just disappeared?

j / k navigate · click thread line to collapse

0 comments

aeve8907mo ago

Until this happen. A single region in a cascade failure and your saas is single region.

stephenlf7mo ago

They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.

oceanplexian7mo ago

Why would your competitors go down? AWS has at best 30-35% market share. And that's ignoring the huge mass of companies who still run their infrastructure on bare metal.

CharlieDigital7mo ago

A whole bunch of meeting bots use Recall.

Recall is on AWS.

Everyone using Recall for meeting recordings is down.

We're on GCP, but we have various SaaS vendors on AWS so any of the services that rely on AWS are gone.

Many chat/meeting services also run on AWS Chime so even if you're not on AWS, if a vendor uses Chime, that service is down.

1 more reply

tetha7mo ago

And this comes in a time with regulations like Dora and the BaFin tightening things - managing these boxes becomes less effort than maintaining compliance across vendors.

1 more reply

daheza7mo ago

Because your competitor probably depends on a service which uses aws. They may host all their stuff in azure, but use cloudfront as cache which uses aws and goes down.

codeduck7mo ago

because your competitors are probably using services that depend on AWS.

palmotea7mo ago

>> People laughed and some said something like "well, if something like AWS falls then we have bigger problems".

> They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.

They made their own bigger problems by all crowding into the same single region.

pluto_modadic7mo ago

it's a weird effect:

So too with outages. Safety / loss of blame in numbers.

immibis7mo ago

Our app was up. I'm sure we made a lot of money.

joshstrange7mo ago

baobabKoodaa7mo ago

Plenty of stuff still works.

swat5357mo ago

Sure but to their point, you're off the hook if half of the internet is down.. it's sort of like: "No one gets fired for picking IBM".

1 more reply

lxgr7mo ago

Nothing short of a nuclear war, a bad deploy, or some operational oopsie, and everybody knows how rare all these things are!

Elidrake247mo ago

zimbu6687mo ago

Of course lots of AWS services have hidden dependencies on us-east-1. During a previous outage we needed to update a Route53(DNS) record in us-west-2, but couldn't because of the outage in us-east-1.

ineedasername7mo ago

ta12437mo ago

Relying on AWS is a single point of failure. Not as much as relying on a single AWS region, but it's still a single point.

It's fairly difficult to avoid single points of failure completely, and if you do it's likely your suppliers and customers haven't managed to.

It's about how much your risk level is.

Sadly it seems people in charge of critical infrastructure don't even bother thinking about these things, because next quarters numbers are more important.

wubrr7mo ago

Some 'regional' AWS services still rely on other services (some internal) that are only in us-east-1.

antinomicus7mo ago

Even Amazon’s own services (ie ring) were affected by this outage

dvsgaevsvsgavsv7mo ago

Amazing. So you will build your own load balancer that sends loads between AWS and Gcloud and make it the single point of failure instewd?

steveBK1237mo ago

I mean given what we've seen with these AWS failures impact, wouldn't any enemies first target be to hit us-east-1 ? Imagine if it just disappeared?

j / k navigate · click thread line to collapse