undefined | Better HN

0 pointsjacquesm7mo ago0 comments

Every week or so we interview a company and ask them if they have a fall-back plan in case AWS goes down or their cloud account disappears. They always have this deer-in-the-headlights look. 'That can't happen, right?'

Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

0 comments

padjo7mo ago

Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years. I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

mlrtime7mo ago

>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.

Not very many people realize that there are some services that still run only in us-east-1.

joelthelion7mo ago

Call it the aws holiday. Most other companies will be down anyway. It's very likely that your company can afford to be down for a few hours, too.

chii7mo ago

imagine if the electricity supplier too that stance.

7 more replies

energy1237mo ago

It happens 2-3x a year during peacetime. Tail events are not homogeneously distributed across time.

vrc7mo ago

Well technically AWS has never failed in wartime.

mlrtime7mo ago

I don't understand, peacetime?

2 more replies

cyberax7mo ago

> Not very many people realize that there are some services that still run only in us-east-1.

The only ones that you're likely to encounter are IAM, Route53, and the billing console. The billing console outage for a few hours is hardly a problem. IAM and Route53 are statically stable and designed to be mostly stand-alone. They are working fine right now, btw.

During this outage, my infrastructure on AWS is working just fine, simply because it's outside of us-east-1.

Ironically, our observability provider went down.

yla927mo ago

> there are some services that still run only in us-east-1.

What are those ?

snowwrestler7mo ago

I would take the opposite view, the little AWS outages are an opportunity to test your disaster recovery plan, which is worth doing even if it takes a little time.

It’s not hard to imagine events that would keep AWS dark for a long period of time, especially if you’re just in one region. The outage today was in us-east-1. Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.

SteveNuts7mo ago

> Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.

Sure, it's worth considering, but for most companies it's not going to be worth the engineering effort to architect cross-cloud services. The complexity is NOT linear.

IMO most shops should focus on testing backups (which should be at least cross-cloud, potentially on-prem of some sort) to make sure their data integrity is solid. Your data can't be recreated, everything else can be rebuilt even if it takes a long time.

davedx7mo ago

> I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

Absurd claim.

Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.

padjo7mo ago

It’s not absurd, I’ve seen it happen. Company executes on their DR plan due to AWS outage, AWS is back before DR is complete, DR has to be aborted, service is down longer than if they’d just waited.

Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.

afro887mo ago

> If your company is in anything finance-adjacent or critical infrastructure

GP said:

> most companies

Most companies aren't finance-adjacent or critical infrastructure

philipallstar7mo ago

> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing

That still fits in with "almost guarantee". It's not as though it's true for everyone, e.g. people who might trigger DR after 10 minutes of downtime, and have it up and running within 30 more minutes.

But it is true for almost everyone, as most people will trigger it after 30 minutes or more, which, plus the time to execute DR, is often going to be far less than the AWS resolution time.

Best of all would be just multi-everything services from the start, and us-east-1 is just another node, but that's expensive and tricky with state.

kelnos7mo ago

> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.

This describes, what, under 1% of companies out there?

For most companies the cost of being multi-region is much more than just accepting with the occasional outage.

malfist7mo ago

I worked for a fortune 500, twice a year we practiced our "catastrophe outage" plan. The target SLA for recovering from a major cloud provider outage was 48 hours.

ants_everywhere7mo ago

I thought we were talking about an AWS outage, not just the outage of a single region? A single region can go out for many reasons, including but not limited to war.

Waterluvian7mo ago

Without having a well-defined risk profile that they’re designing to satisfy, everyone’s just kind of shooting from the hip with their opinions on what’s too much or too little.

hnlmorg7mo ago

Exactly this!

One of my projects is entirely hosted on S3. I don't care enough if it becomes unavailable for a few hours to justify paying to distribute it to GCP et al.

And actually for most companies, the cost of multi-cloud is greater than the benefits. Particularly when those larger entities can just bitch to their AWS account manager to get a few grand refunded as credits.

AndrewThrowaway7mo ago

It is like discussing zombie apocalypse. People who are invested in bunkers will hardly understand those who are just choosing death over living in those bunkers for a month longer.

throw0101d7mo ago

> Planning for an AWS outage […]

What about if your account gets deleted? Or compromised and all your instances/services deleted?

I think the idea is to be able to have things continue running on not-AWS.

rjmunro7mo ago

This. I wouldn't try to instantly failover to another service if AWS had a short outage, but I would plan to be able to recover from a permanent AWS outage by ensuring all your important data and knowledge is backed up off-AWS, preferably to your own physical hardware and having a vague plan of how to restore and bring things up again if you need to.

"Permanent AWS outage" includes someone pressing the wrong button in the AWS console and deleting something important or things like a hack or ransomware attack corrupting your data, as well as your account being banned or whatever. While it does include AWS itself going down in a big way, it's extremely unlikely that it won't come back, but if you cover other possibilities, that will probably be covered too.

psychoslave7mo ago

This is planning future based on best things in the past. Not completely irrational and if you can't afford plan B, okayish.

But thinking AWS SLA is granted forever, and everyone should put all its eggs in it because "everyone do it" is neither wise not safe. Those who can afford it, and there are many businesses like that out there, should have a plan B. And actually AWS should not necessarily be plan A.

Nothing is forever. Not the Roman empire, not Inca empire, not china dynasties, not USA geological supremacy. That's not a question of if but when. It doesn't need to be through a lot of suffering, but if we don't systematically organise for a humanity which spread well being for everyone in a systematically resilient way, we will make it through more lot of tragic consequences when this or that single point of failure finally falls.

kxrm7mo ago

Completely agree, but I think companies need to be aware of the AWS risks with third parties as well. Many services were unable to communicate with customers.

Hosting your services on AWS while having a status page on AWS during an AWS outage is an easily avoidable problem.

lumost7mo ago

This depends on the scale of company. A fully functional DR plan probably costs 10% of the infra spend + people time for operationalization. For most small/medium businesses its a waste to plan for a once per 3-10 year event. If you’re a large or legacy firm the above costs are trivial and in some cases it may become a fiduciary risk not to take it seriously.

jacquesmOP7mo ago

And if you're in a regulated industry it might even be a hard requirement.

maerF0x07mo ago

Using AWS instead of a server in the closet is step 1.

Step 2 is multi-AZ

Step 3 is multi-region

Step 4 is multi-cloud.

Each company can work on it's next step, but most will not have positive EROI going from 2 to 3+

joshuat7mo ago

Multi-cloud is a hole in which you can burn money and not much more

coffeebeqn7mo ago

We started that planning process at my previous company after one such outage but it became clear very quickly that the costs of such resilience would be 2-3x hosting costs in perpetuity and who knows how many manhours. Being down for an hour was a lot more palatable to everyone

pyrale7mo ago

What if AWS dumps you because your country/company didn't please the commander in chief enough?

If your resilience plan is to trust a third party, that means you don't really care about going down, does it?

Besides that, as the above poster said, the issue with top tier cloud providers (or cloudflare, or google, etc) is not just that you rely on them, it is that enough people rely on them that you may suffer even if you don't.

dangoldin7mo ago

I worked at an adtech company where we invested a bit in HA across AZ + regions. Lo and behold there was an AWS outage and we stayed up. Too bad our customers didn't and we still took the revenue hit.

Lesson here is that your approach will depend on your industry and peers. Every market will have their won philosophy and requirements here.

Spooky237mo ago

Sure, if your blog or whatever goes down who cares. But otherwise you should thinking about disaster planning and resilience.

AWS US-East 1 has many outages. Anything significant should account for that.

antihero7mo ago

My website running on an old laptop in my cupboard is doing just fine.

api7mo ago

I have this theory of something I call “importance radiation.”

An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.

Related, I’m sure, is the fact that things like furnaces and water heaters will die on holidays.

jacquesmOP7mo ago

That's a great concept. It explains a lot, actually!

whatevaa7mo ago

When your laptop dies it's gonna be a pretty long outage too.

antihero7mo ago

I will find another one

lucideer7mo ago

> to the tune of a few hours every 5-10 years

I presume this means you must not be working for a company running anything at scale on AWS.

skywhopper7mo ago

That is the vast majority of customers on AWS.

lucideer7mo ago

Ha ha, fair, fair.

nucleardog7mo ago

> Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

Not only that, but as you're seeing with this and the last few dozen outages... when us-east-1 goes down, a solid chunk of what many consumers consider the "internet" goes down. It's perceived less as "app C is down" and more is "the internet is broken today".

jgwil27mo ago

Isn't the endpoint of that kind of thinking an even more centralized and fragile internet?

nucleardog7mo ago

To be clear, I'm not advocating for this or trying to suggest it's a good thing. That's just reality as I see it.

If my site's offline at the same time as the BBC has front page articles about how AWS is down and it's broken half the internet... it makes it _really_ easy for me to avoid blame without actually addressing the problem.

I don't need to deflect blame from my customers. Chances are they've already run into several other broken services today, they've seen news articles about it, and all from third parties. By the time they notice my service is down, they probably won't even bother asking me about it.

I can definitely see this encouraging more centralization, yes.

YouAreWRONGtoo7mo ago

More like 2-3 times per year and this is not counting smaller outages or simply APIs that don't do what they document.

sgarland7mo ago

> APIs that don’t do what they document

Oh god, this. At my company, we found a bug recently with rds.describe_events, which we needed to read binlog information after a B/G cutover. The bug, which AWS support “could not see the details of,” was that events would non-deterministically not show up if you were filtering by instance name. Their recommended fix was to pull in all events for the past N minutes, and do client-side filtering.

This was on top of the other bug I had found earlier, which was that despite the docs stating that you can use a B/G as a filter - a logical choice when querying for information directly related to the B/G you just cut over - doing so returns an empty set. Also, you can’t use a cluster (again, despite docs stating otherwise), you have to use the new cluster’s writer instance.

YouAreWRONGtoo7mo ago

While I don't know your specific case, I have seen it happen often enough that there are only two possibilities left:

  1. they are idiots 
  2. they do it on purpose and they think you are an idiot

For me, it just means that the moment you integrate with any API, you are basically their bitch (unless you implement one from every competitor in the market, at which point you can just as well do it yourself).

croes7mo ago

Telefonica is moving it 5G core network to AWS

https://aws.amazon.com/blogs/industries/o2-telefonica-moves-...

A few hours could be a problem.

Not to mention it creates valuable a single point of failure for a hostile attack.

zaphirplane7mo ago

> tune of a few hours every 5-10 years

You know that’s not true, is-east-1 last one was 2 years ago? But other services have bad days and foundational one drag others a long

Esophagus47mo ago

It’s even worse than that - us-east-1 is so overloaded, and they have roughly 5+ outages per year on different services. They don’t publish outage numbers so it’s hard to tell.

At this point, being in any other region cuts your disaster exposure dramatically

coffeebeqn7mo ago

We don’t deploy to us-east but still so many of our API partners and 3rd party services were down a large chunk of the service was effectively down. Including stuff like many dev tools

sreekanth8507mo ago

Depends on how serious you are with SLA's.

indoordin0saur7mo ago

Been doing this for about 8 years and I've worked through a serious AWS disruption at least 5 times in that time.

temperceve7mo ago

Depends on the business. For 99% of them this is for sure the right answer.

delfinom7mo ago

In before meteor strike takes a AWS region and they cant restore data.

kelseydh7mo ago

It seems like this can be mostly avoided by not using us-east-1.

DiffEq7mo ago

Maybe; but Parlar had no plan and are now nothing....because AWS decided to shut them off. Always have a good plan...

jacquesmOP7mo ago

Thank you for illustrating my point. You didn't even bother to read the second paragraph.

shawabawa37mo ago

> Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt

jacquesmOP7mo ago

Is that also your contingency plan for 'user uploads objectionable content and alerts Amazon to get your account shut down'?

Make sure you let your investors know.

1 more reply

mlrtime7mo ago

We all read it.. AWS not coming back up is your point on nat having a backup plan?

You might as well say the entire NY + DC metro losses power and "never comes back up" What is the plan around that? The person replying is correct, most companies do not have a actionable plan for AWS never coming back up.

I worked at a medium-large company and was responsible for reviewing the infrastructure BCP plan. It stated that AWS going down was a risk, and if it happens we wait for it to come back up. (In a lot more words than that).

chanux7mo ago

I get you. I am with you. But isn't money/resources always a constraint to have a solid backup solution?

I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!

I've got to admit though, whenever I hear about having a backup plan I think having apples to apples copy elsewhere which is probably not wise/viable anyway. Perhaps having just enough to reach out to the service users/customers suffice.

Also I must add I am heavily influenced by a comment by Adrian Cockroft on why going multi cloud isn't worth it. He worked for AWS (at the time at least) so I should have probably reached to the salt dispenser.

ho_schi7mo ago

The internet is a weak infrastructure, relying on a few big cables and data centers. And through AWS and Cloudflare it has become worse? Was it ever true, that the internet is resilient? I doubt.

Resilient systems work autonomously and can synchronize - but don't need to synchronize.

    * Git is resilient.
    * Native E-Mail clients - with local storage enabled - are somewhat resilient.
    * A local package repository is - somewhat resilient.
    * A local file-sharing app (not Warp/ Magic-Wormhole -> needs relay) is resilient if it uses only local WiFi or Bluetooth.

We're building weak infrastructure. A lot of stuff shall work locally and only optionally use the internet.

CaptainOfCoit7mo ago

The internet seems resilient enough for all intents and purposes, we haven't had a global internet-wide catastrophe impacting the entire internet as far as I know, but we have gotten close to it sometimes (thanks BGP).

But the web, that's the fragile, centralized and weak point currently, and seems to be what you're referring to rather.

Maybe nitpicky, but I feel like it's important to distinguish between "the web" and "the internet".

rjmunro7mo ago

> The internet seems resilient enough...

The word "seems" is doing a lot of heavy lifting there.

CaptainOfCoit7mo ago

I don't wanna jinx anything, but yeah, seems. I can't remember a single global internet outage for the 30+ years I've been alive. But again, large services gone down, but the internet infrastructure seems to keep on going regardless.

ho_schi7mo ago

Sweden and the “Coop” disaster:

https://www.bbc.com/news/technology-57707530

That because people trust and hope blindly. They believe IT is for saving money? It isn’t. They coupled their cash registers onto an American cloud service. The couldn’t even pay in cash.

It usually gets worse, when not outages happens for some time. Because that increases blind trust.

CaptainOfCoit7mo ago

That a Swedish supermarket gets hit by a ransomware attack doesn't prove/disprove the overall stability of the internet, nor the fragility of the web.

jacquesmOP7mo ago

You are absolutely correct but this distinction is getting less and less important, everything is using APIs nowadays, including lots of stuff that is utterly invisible until it goes down.

bombcar7mo ago

The Internet was much more resilient when it was just that - an internetwork of connected networks; each of which could and did operate autonomously.

Now we have computers that shit themselves if DNS isn’t working, let alone LANs that can operate disconnected from the Internet as a whole.

And partially working or indicating this it works (when it doesn’t) is usually even worse.

smaudet7mo ago

If you take into account the "the web" vs "the internet" as others have mentioned.

Yes the Internet has stayed stable.

The Web, as defined by a bunch of servers running complex software, probably much less so.

Just the fact that it must necessarily be more complex means that it has more failure modes...

raincole7mo ago

Most companies just aren't important enough to worry about "AWS never come back up." Planning for this case is just like planning for a terrorist blowing up your entire office. If you're the Pentagon sure you'd better have a plan for that. But most companies are not the Pentagon.

paulddraper7mo ago

Exactly.

And FWIW, "AWS is down"....only one region (out of 36) of AWS is down.

You can do the multi-region failover, though that's still possibly overkill for most.

Frieren7mo ago

> Most companies just aren't important enough to worry about "AWS never come back up."

But a large enough of "not too big to fail" companies become a too big to fail event. Too many medium sized companies have a total dependency on AWS or Google or Microsoft, if not several at the same time.

We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.

raincole7mo ago

Well I agree. I kinda think the AI apocalypse would not be like a skynet killing us, but a malware be patched onto all the Tesla and causes one million crashes tomorrow morning.

swader9997mo ago

Battery fires.

coffeebeqn7mo ago

Many have a hard dependency on AWS && Google && Microsoft!

anal_reactor7mo ago

First, planning for AWS outage is pointless. Unless you provide service of national security or something, your customers are going to understand that when there's global internet outage your service doesn't work either. The cost of maintaining a working failover across multiple cloud providers is just too high compared to potential benefits. It's astonishing that so few eningeers understand the fact that maintaining a technically beautiful solution costs time and money, which might not make a justified business case.

Second, preparing for the disappearance of AWS is even more silly. The chance that it will happen are orders of magnitude smaller than the chance that the cost of preparing for such an event will kill your business.

Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?

jacquesmOP7mo ago

> Let me ask you: how do you prepare your website for the complete collapse of western society?

How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?

> Second, preparing for the disappearance of AWS is even more silly.

What's silly is not thinking ahead.

psychoslave7mo ago

>Let me ask you: how do you prepare your website for the complete collapse of western society?

That's the main topic that going through my mind lately, if you replace "my website" with "Wikimedia movement".

We need a far better social, juridical and technical architecture regarding resilience as hostil agendas are in the rise at all level agaisnt sourced trackable global volunteer community knowledge bases.

pmontra7mo ago

In the case of a costumer of mine the AWS outage manifested itself as Twilio failing to deliver SMSes. The fallback plan has been disabling the rotation of our two SMS providers and sending all messages with the remaing one. But what if the other one had something on AWS too? Or maybe both of them have something else vital on Azure, or Google Cloud, which will fail next week and stop our service. Who knows?

For small and medium sized companies it's not easy to perform an accurate due diligency.

rglover7mo ago

It would behoove a lot of devs to learn the basics of Linux sysadmin and how to setup a basic deployment with a VPS. Once you understand that, you'll realize how much of "modern infra" is really just a mix of over-reliance on AWS and throwing compute at underperforming code. Our addiction to complexity (and burning money on the illusion of infinite stability) is already and will continue to strangle us.

rco87867mo ago

If AWS goes down unexpectedly and never comes back up it's much more likely that we're in the middle of some enormous global conflict where day to day survival takes priority over making your app work than AWS just deciding to abandon their cloud business on a whim.

CaptainOfCoit7mo ago

Can also be much easier than that. Say you live in Mexico, hosting servers with AWS in the US because you have US customers. But suddenly the government decides to place sanctions on Mexico, and US entities are no longer allowed to do business with Mexicans, so all Mexican AWS accounts get shut down.

For you as a Mexican the end results is the same, AWS went away, and considering there already is a list of countries that cannot use AWS, GitHub and a bunch of other "essential" services, it's not hard to imagine that that list might grow in the future.

chasd007mo ago

what's most realistic is something like a major scandal at AWS. The FBI seizes control and no bytes come in our out until the investigation is complete. A multi-year total outage effectively.

apexalpha7mo ago

Or Trump decided your country does not deserve it.

bombcar7mo ago

Or Bezos.

dr-smooth7mo ago

Or Bezos selling his soul to the Orange Devil and kicking you off when the Conman-in-chief puts the squeeze on some other aspect of Bezos' business empire

hvb27mo ago

> The internet got its main strengths from the fact that it was completely decentralized.

Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.

The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.

lentil_soup7mo ago

> Decentralized in terms of many companies making up the internet

Not companies, the protocols are decentralized and at some point it was mostly non-companies. Anyone can hook up a computer and start serving requests which was/is a radical concept, we've lost a lot, unfortunately

hvb27mo ago

No we've not lost that at all. Nobody prevents you from doing that.

We have put more and more services on fewer and fewer vendors. But that's the consolidation and cost point.

IlikeKitties7mo ago

> No we've not lost that at all. Nobody prevents you from doing that.

May I introduce you to our Lord and Slavemaster CGNAT?

3 more replies

jacquesmOP7mo ago

I think one reason is that people are just bad at statistics. Chance of materialization * impact = small. Sure. Over a short enough time that's true for any kind of risk. But companies tend to live for years, decades even and sometimes longer than that. If we're going to put all of those precious eggs in one basket, as long as the basket is substantially stronger than the eggs we're fine, right? Until the day someone drops the basket. And over a long enough time span all risks eventually materialize. So we're playing this game, and usually we come out ahead.

But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.

hvb27mo ago

Absolutely, but the cost of perfection (100% uptime in this case) is infinite.

As long as the outages are rare enough and you automatically fail over to a different region, what's the problem?

jacquesmOP7mo ago

Often simply the lack of a backup outside of the main cloud account.

1 more reply

chasd007mo ago

Decentralized with respect to connectivity. If a construction crew cuts a fiber bundle routing protocols will route around the damage and packets keep showing up at the destination. Or, only a localized group of users will be affected. That level of decentralization is not what we have at higher levels in the stack with AWS being a good example.

Even connectivity has it's points of failure. I've touched with my own hands fiber runs that, with a few quick snips from a wire cutter, could bring sizable portions of the Internet offline. Granted that was a long time ago so those points of failure may no longer exist.

psychoslave7mo ago

Well, that is exactly what resilient distributed network are about. Not that much the technical details we implement them through, but the social relationship and balanced in political decision power.

Be it a company or a state, concentration of power that exeeds the needs for their purpose to function by a large margin is always a sure way to spread corruption, create feedback loop in single point of failure, and is buying everyone a ticket to some dystopian reality with a level of certainty that beats anything that a SLA will ever give us.

OfflineSergio7mo ago

I don't think its worth it, but lets say I did it, what if others that I depend on dont do it? I still won't be fully functional and only one of us have spent a bunch of money.

JamesSwift7mo ago

What good is jumping through extraordinary hoops to be multi cloud if docker, netlify, stripe, intercom, npm, etc all go down along with us-east-1?

fisf7mo ago

Because you should not depend on one payment provider and pull unvendored images, packages, etc directly into your deployment.

There is no reason to have such brittle infra.

JamesSwift7mo ago

Sure, but at that point you go from bog standard to "enterprise grade redundancy for every single point of failure" which I can assure you is more heavily engineered than many enterprises (source: see current outage). Its just not worth the manpower and dollars for a vast majority of businesses.

fisf7mo ago

Pulling unvetted stuff from docker hub, npm, etc. is not a question of redundancy.

1 more reply

freetanga7mo ago

Additionally I find that most Hyperscalers are trying to lock you in, by tailoring services which are industry standard with custom features which end up building roots and making a multi-vendor or lift and shift problematic.

Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…

jacquesmOP7mo ago

Yes, they're really good at that. This is just 'embrace and extend'. We all know the third.

bschne7mo ago

I find this hard to judge in the abstract, but I'm not quite convinced the situation for the modal company today is worse than their answer to "what if your colo rack catches fire" would have been twenty years ago.

jacquesmOP7mo ago

> "what if your colo rack catches fire"

I've actually had that.

https://www.webmasterworld.com/webmaster/3663978.htm

bschne7mo ago

I used to work at an SME that ran ~everything on its own colo'd hardware, and while it never got this bad, there were a couple instances of the CTO driving over to the dc because the oob access to some hung up server wasn't working anymore. Fun times...

pluto_modadic7mo ago

oh hey, I've bricked a server remotely and had to drive 45 minutes to the DC to get badged in and reboot things :)

1 more reply

csomar7mo ago

> Now imagine for a bit that it will never come back up.

Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.

bongodongobob7mo ago

You simply cannot avoid it. There are so many applications and services that use AWS. Companies cant sit on 100% in-house software stacks.

invalidusernam37mo ago

What if the fall-back also never comes back up?

kbar137mo ago

the correct answer for those companies is "we have it on the roadmap but for right now accept the risk"

Keyframe7mo ago

At least we've got github steady with our code and IaaC, right? Right?!

saltyoldman7mo ago

Contrast this with the top post.

j / k navigate · click thread line to collapse

0 comments

padjo7mo ago

mlrtime7mo ago

>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

Not very many people realize that there are some services that still run only in us-east-1.

joelthelion7mo ago

Call it the aws holiday. Most other companies will be down anyway. It's very likely that your company can afford to be down for a few hours, too.

chii7mo ago

imagine if the electricity supplier too that stance.

7 more replies

energy1237mo ago

It happens 2-3x a year during peacetime. Tail events are not homogeneously distributed across time.

vrc7mo ago

Well technically AWS has never failed in wartime.

mlrtime7mo ago

I don't understand, peacetime?

2 more replies

cyberax7mo ago

> Not very many people realize that there are some services that still run only in us-east-1.

During this outage, my infrastructure on AWS is working just fine, simply because it's outside of us-east-1.

Ironically, our observability provider went down.

yla927mo ago

> there are some services that still run only in us-east-1.

What are those ?

snowwrestler7mo ago

I would take the opposite view, the little AWS outages are an opportunity to test your disaster recovery plan, which is worth doing even if it takes a little time.

SteveNuts7mo ago

> Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.

Sure, it's worth considering, but for most companies it's not going to be worth the engineering effort to architect cross-cloud services. The complexity is NOT linear.

davedx7mo ago

> I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

Absurd claim.

padjo7mo ago

Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.

afro887mo ago

> If your company is in anything finance-adjacent or critical infrastructure

GP said:

> most companies

Most companies aren't finance-adjacent or critical infrastructure

philipallstar7mo ago

> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing

But it is true for almost everyone, as most people will trigger it after 30 minutes or more, which, plus the time to execute DR, is often going to be far less than the AWS resolution time.

Best of all would be just multi-everything services from the start, and us-east-1 is just another node, but that's expensive and tricky with state.

kelnos7mo ago

> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.

This describes, what, under 1% of companies out there?

For most companies the cost of being multi-region is much more than just accepting with the occasional outage.

malfist7mo ago

I worked for a fortune 500, twice a year we practiced our "catastrophe outage" plan. The target SLA for recovering from a major cloud provider outage was 48 hours.

ants_everywhere7mo ago

I thought we were talking about an AWS outage, not just the outage of a single region? A single region can go out for many reasons, including but not limited to war.

Waterluvian7mo ago

Without having a well-defined risk profile that they’re designing to satisfy, everyone’s just kind of shooting from the hip with their opinions on what’s too much or too little.

hnlmorg7mo ago

Exactly this!

One of my projects is entirely hosted on S3. I don't care enough if it becomes unavailable for a few hours to justify paying to distribute it to GCP et al.

AndrewThrowaway7mo ago

It is like discussing zombie apocalypse. People who are invested in bunkers will hardly understand those who are just choosing death over living in those bunkers for a month longer.

throw0101d7mo ago

> Planning for an AWS outage […]

What about if your account gets deleted? Or compromised and all your instances/services deleted?

I think the idea is to be able to have things continue running on not-AWS.

rjmunro7mo ago

psychoslave7mo ago

This is planning future based on best things in the past. Not completely irrational and if you can't afford plan B, okayish.

kxrm7mo ago

Completely agree, but I think companies need to be aware of the AWS risks with third parties as well. Many services were unable to communicate with customers.

Hosting your services on AWS while having a status page on AWS during an AWS outage is an easily avoidable problem.

lumost7mo ago

jacquesmOP7mo ago

And if you're in a regulated industry it might even be a hard requirement.

maerF0x07mo ago

Using AWS instead of a server in the closet is step 1.

Step 2 is multi-AZ

Step 3 is multi-region

Step 4 is multi-cloud.

Each company can work on it's next step, but most will not have positive EROI going from 2 to 3+

joshuat7mo ago

Multi-cloud is a hole in which you can burn money and not much more

coffeebeqn7mo ago

pyrale7mo ago

What if AWS dumps you because your country/company didn't please the commander in chief enough?

If your resilience plan is to trust a third party, that means you don't really care about going down, does it?

dangoldin7mo ago

Lesson here is that your approach will depend on your industry and peers. Every market will have their won philosophy and requirements here.

Spooky237mo ago

Sure, if your blog or whatever goes down who cares. But otherwise you should thinking about disaster planning and resilience.

AWS US-East 1 has many outages. Anything significant should account for that.

antihero7mo ago

My website running on an old laptop in my cupboard is doing just fine.

api7mo ago

I have this theory of something I call “importance radiation.”

An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.

Related, I’m sure, is the fact that things like furnaces and water heaters will die on holidays.

jacquesmOP7mo ago

That's a great concept. It explains a lot, actually!

whatevaa7mo ago

When your laptop dies it's gonna be a pretty long outage too.

antihero7mo ago

I will find another one

lucideer7mo ago

> to the tune of a few hours every 5-10 years

I presume this means you must not be working for a company running anything at scale on AWS.

skywhopper7mo ago

That is the vast majority of customers on AWS.

lucideer7mo ago

Ha ha, fair, fair.

nucleardog7mo ago

> Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

jgwil27mo ago

Isn't the endpoint of that kind of thinking an even more centralized and fragile internet?

nucleardog7mo ago

To be clear, I'm not advocating for this or trying to suggest it's a good thing. That's just reality as I see it.

I can definitely see this encouraging more centralization, yes.

YouAreWRONGtoo7mo ago

More like 2-3 times per year and this is not counting smaller outages or simply APIs that don't do what they document.

sgarland7mo ago

> APIs that don’t do what they document

YouAreWRONGtoo7mo ago

While I don't know your specific case, I have seen it happen often enough that there are only two possibilities left:

  1. they are idiots 
  2. they do it on purpose and they think you are an idiot

croes7mo ago

Telefonica is moving it 5G core network to AWS

https://aws.amazon.com/blogs/industries/o2-telefonica-moves-...

A few hours could be a problem.

Not to mention it creates valuable a single point of failure for a hostile attack.

zaphirplane7mo ago

> tune of a few hours every 5-10 years

You know that’s not true, is-east-1 last one was 2 years ago? But other services have bad days and foundational one drag others a long

Esophagus47mo ago

It’s even worse than that - us-east-1 is so overloaded, and they have roughly 5+ outages per year on different services. They don’t publish outage numbers so it’s hard to tell.

At this point, being in any other region cuts your disaster exposure dramatically

coffeebeqn7mo ago

We don’t deploy to us-east but still so many of our API partners and 3rd party services were down a large chunk of the service was effectively down. Including stuff like many dev tools

sreekanth8507mo ago

Depends on how serious you are with SLA's.

indoordin0saur7mo ago

Been doing this for about 8 years and I've worked through a serious AWS disruption at least 5 times in that time.

temperceve7mo ago

Depends on the business. For 99% of them this is for sure the right answer.

delfinom7mo ago

In before meteor strike takes a AWS region and they cant restore data.

kelseydh7mo ago

It seems like this can be mostly avoided by not using us-east-1.

DiffEq7mo ago

Maybe; but Parlar had no plan and are now nothing....because AWS decided to shut them off. Always have a good plan...

jacquesmOP7mo ago

Thank you for illustrating my point. You didn't even bother to read the second paragraph.

shawabawa37mo ago

my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt

jacquesmOP7mo ago

Is that also your contingency plan for 'user uploads objectionable content and alerts Amazon to get your account shut down'?

Make sure you let your investors know.

1 more reply

mlrtime7mo ago

We all read it.. AWS not coming back up is your point on nat having a backup plan?

chanux7mo ago

I get you. I am with you. But isn't money/resources always a constraint to have a solid backup solution?

I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!

ho_schi7mo ago

The internet is a weak infrastructure, relying on a few big cables and data centers. And through AWS and Cloudflare it has become worse? Was it ever true, that the internet is resilient? I doubt.

Resilient systems work autonomously and can synchronize - but don't need to synchronize.

    * Git is resilient.
    * Native E-Mail clients - with local storage enabled - are somewhat resilient.
    * A local package repository is - somewhat resilient.
    * A local file-sharing app (not Warp/ Magic-Wormhole -> needs relay) is resilient if it uses only local WiFi or Bluetooth.

We're building weak infrastructure. A lot of stuff shall work locally and only optionally use the internet.

CaptainOfCoit7mo ago

But the web, that's the fragile, centralized and weak point currently, and seems to be what you're referring to rather.

Maybe nitpicky, but I feel like it's important to distinguish between "the web" and "the internet".

rjmunro7mo ago

> The internet seems resilient enough...

The word "seems" is doing a lot of heavy lifting there.

CaptainOfCoit7mo ago

ho_schi7mo ago

Sweden and the “Coop” disaster:

https://www.bbc.com/news/technology-57707530

That because people trust and hope blindly. They believe IT is for saving money? It isn’t. They coupled their cash registers onto an American cloud service. The couldn’t even pay in cash.

It usually gets worse, when not outages happens for some time. Because that increases blind trust.

CaptainOfCoit7mo ago

That a Swedish supermarket gets hit by a ransomware attack doesn't prove/disprove the overall stability of the internet, nor the fragility of the web.

jacquesmOP7mo ago

You are absolutely correct but this distinction is getting less and less important, everything is using APIs nowadays, including lots of stuff that is utterly invisible until it goes down.

bombcar7mo ago

The Internet was much more resilient when it was just that - an internetwork of connected networks; each of which could and did operate autonomously.

Now we have computers that shit themselves if DNS isn’t working, let alone LANs that can operate disconnected from the Internet as a whole.

And partially working or indicating this it works (when it doesn’t) is usually even worse.

smaudet7mo ago

If you take into account the "the web" vs "the internet" as others have mentioned.

Yes the Internet has stayed stable.

The Web, as defined by a bunch of servers running complex software, probably much less so.

Just the fact that it must necessarily be more complex means that it has more failure modes...

raincole7mo ago

paulddraper7mo ago

Exactly.

And FWIW, "AWS is down"....only one region (out of 36) of AWS is down.

You can do the multi-region failover, though that's still possibly overkill for most.

Frieren7mo ago

> Most companies just aren't important enough to worry about "AWS never come back up."

We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.

raincole7mo ago

Well I agree. I kinda think the AI apocalypse would not be like a skynet killing us, but a malware be patched onto all the Tesla and causes one million crashes tomorrow morning.

swader9997mo ago

Battery fires.

coffeebeqn7mo ago

Many have a hard dependency on AWS && Google && Microsoft!

anal_reactor7mo ago

Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?

jacquesmOP7mo ago

> Let me ask you: how do you prepare your website for the complete collapse of western society?

How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?

> Second, preparing for the disappearance of AWS is even more silly.

What's silly is not thinking ahead.

psychoslave7mo ago

>Let me ask you: how do you prepare your website for the complete collapse of western society?

That's the main topic that going through my mind lately, if you replace "my website" with "Wikimedia movement".

pmontra7mo ago

For small and medium sized companies it's not easy to perform an accurate due diligency.

rglover7mo ago

rco87867mo ago

CaptainOfCoit7mo ago

chasd007mo ago

what's most realistic is something like a major scandal at AWS. The FBI seizes control and no bytes come in our out until the investigation is complete. A multi-year total outage effectively.

apexalpha7mo ago

Or Trump decided your country does not deserve it.

bombcar7mo ago

Or Bezos.

dr-smooth7mo ago

Or Bezos selling his soul to the Orange Devil and kicking you off when the Conman-in-chief puts the squeeze on some other aspect of Bezos' business empire

hvb27mo ago

> The internet got its main strengths from the fact that it was completely decentralized.

Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.

The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.

lentil_soup7mo ago

> Decentralized in terms of many companies making up the internet

hvb27mo ago

No we've not lost that at all. Nobody prevents you from doing that.

We have put more and more services on fewer and fewer vendors. But that's the consolidation and cost point.

IlikeKitties7mo ago

> No we've not lost that at all. Nobody prevents you from doing that.

May I introduce you to our Lord and Slavemaster CGNAT?

3 more replies

jacquesmOP7mo ago

But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.

hvb27mo ago

Absolutely, but the cost of perfection (100% uptime in this case) is infinite.

As long as the outages are rare enough and you automatically fail over to a different region, what's the problem?

jacquesmOP7mo ago

Often simply the lack of a backup outside of the main cloud account.

1 more reply

chasd007mo ago

psychoslave7mo ago

OfflineSergio7mo ago

I don't think its worth it, but lets say I did it, what if others that I depend on dont do it? I still won't be fully functional and only one of us have spent a bunch of money.

JamesSwift7mo ago

What good is jumping through extraordinary hoops to be multi cloud if docker, netlify, stripe, intercom, npm, etc all go down along with us-east-1?

fisf7mo ago

Because you should not depend on one payment provider and pull unvendored images, packages, etc directly into your deployment.

There is no reason to have such brittle infra.

JamesSwift7mo ago

fisf7mo ago

Pulling unvetted stuff from docker hub, npm, etc. is not a question of redundancy.

1 more reply

freetanga7mo ago

Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…

jacquesmOP7mo ago

Yes, they're really good at that. This is just 'embrace and extend'. We all know the third.

bschne7mo ago

jacquesmOP7mo ago

> "what if your colo rack catches fire"

I've actually had that.

https://www.webmasterworld.com/webmaster/3663978.htm

bschne7mo ago

pluto_modadic7mo ago

oh hey, I've bricked a server remotely and had to drive 45 minutes to the DC to get badged in and reboot things :)

1 more reply

csomar7mo ago

> Now imagine for a bit that it will never come back up.

Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.

bongodongobob7mo ago

You simply cannot avoid it. There are so many applications and services that use AWS. Companies cant sit on 100% in-house software stacks.

invalidusernam37mo ago

What if the fall-back also never comes back up?

kbar137mo ago

the correct answer for those companies is "we have it on the roadmap but for right now accept the risk"

Keyframe7mo ago

At least we've got github steady with our code and IaaC, right? Right?!

saltyoldman7mo ago

Contrast this with the top post.

j / k navigate · click thread line to collapse