Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.
It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.
Not very many people realize that there are some services that still run only in us-east-1.
The only ones that you're likely to encounter are IAM, Route53, and the billing console. The billing console outage for a few hours is hardly a problem. IAM and Route53 are statically stable and designed to be mostly stand-alone. They are working fine right now, btw.
During this outage, my infrastructure on AWS is working just fine, simply because it's outside of us-east-1.
Ironically, our observability provider went down.
What are those ?
It’s not hard to imagine events that would keep AWS dark for a long period of time, especially if you’re just in one region. The outage today was in us-east-1. Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.
Sure, it's worth considering, but for most companies it's not going to be worth the engineering effort to architect cross-cloud services. The complexity is NOT linear.
IMO most shops should focus on testing backups (which should be at least cross-cloud, potentially on-prem of some sort) to make sure their data integrity is solid. Your data can't be recreated, everything else can be rebuilt even if it takes a long time.
Absurd claim.
Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.
Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.
GP said:
> most companies
Most companies aren't finance-adjacent or critical infrastructure
That still fits in with "almost guarantee". It's not as though it's true for everyone, e.g. people who might trigger DR after 10 minutes of downtime, and have it up and running within 30 more minutes.
But it is true for almost everyone, as most people will trigger it after 30 minutes or more, which, plus the time to execute DR, is often going to be far less than the AWS resolution time.
Best of all would be just multi-everything services from the start, and us-east-1 is just another node, but that's expensive and tricky with state.
This describes, what, under 1% of companies out there?
For most companies the cost of being multi-region is much more than just accepting with the occasional outage.
One of my projects is entirely hosted on S3. I don't care enough if it becomes unavailable for a few hours to justify paying to distribute it to GCP et al.
And actually for most companies, the cost of multi-cloud is greater than the benefits. Particularly when those larger entities can just bitch to their AWS account manager to get a few grand refunded as credits.
What about if your account gets deleted? Or compromised and all your instances/services deleted?
I think the idea is to be able to have things continue running on not-AWS.
"Permanent AWS outage" includes someone pressing the wrong button in the AWS console and deleting something important or things like a hack or ransomware attack corrupting your data, as well as your account being banned or whatever. While it does include AWS itself going down in a big way, it's extremely unlikely that it won't come back, but if you cover other possibilities, that will probably be covered too.
But thinking AWS SLA is granted forever, and everyone should put all its eggs in it because "everyone do it" is neither wise not safe. Those who can afford it, and there are many businesses like that out there, should have a plan B. And actually AWS should not necessarily be plan A.
Nothing is forever. Not the Roman empire, not Inca empire, not china dynasties, not USA geological supremacy. That's not a question of if but when. It doesn't need to be through a lot of suffering, but if we don't systematically organise for a humanity which spread well being for everyone in a systematically resilient way, we will make it through more lot of tragic consequences when this or that single point of failure finally falls.
Hosting your services on AWS while having a status page on AWS during an AWS outage is an easily avoidable problem.
Step 2 is multi-AZ
Step 3 is multi-region
Step 4 is multi-cloud.
Each company can work on it's next step, but most will not have positive EROI going from 2 to 3+
If your resilience plan is to trust a third party, that means you don't really care about going down, does it?
Besides that, as the above poster said, the issue with top tier cloud providers (or cloudflare, or google, etc) is not just that you rely on them, it is that enough people rely on them that you may suffer even if you don't.
Lesson here is that your approach will depend on your industry and peers. Every market will have their won philosophy and requirements here.
AWS US-East 1 has many outages. Anything significant should account for that.
An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.
Related, I’m sure, is the fact that things like furnaces and water heaters will die on holidays.
I presume this means you must not be working for a company running anything at scale on AWS.
Not only that, but as you're seeing with this and the last few dozen outages... when us-east-1 goes down, a solid chunk of what many consumers consider the "internet" goes down. It's perceived less as "app C is down" and more is "the internet is broken today".
Oh god, this. At my company, we found a bug recently with rds.describe_events, which we needed to read binlog information after a B/G cutover. The bug, which AWS support “could not see the details of,” was that events would non-deterministically not show up if you were filtering by instance name. Their recommended fix was to pull in all events for the past N minutes, and do client-side filtering.
This was on top of the other bug I had found earlier, which was that despite the docs stating that you can use a B/G as a filter - a logical choice when querying for information directly related to the B/G you just cut over - doing so returns an empty set. Also, you can’t use a cluster (again, despite docs stating otherwise), you have to use the new cluster’s writer instance.
https://aws.amazon.com/blogs/industries/o2-telefonica-moves-...
A few hours could be a problem.
Not to mention it creates valuable a single point of failure for a hostile attack.
You know that’s not true, is-east-1 last one was 2 years ago? But other services have bad days and foundational one drag others a long
At this point, being in any other region cuts your disaster exposure dramatically
my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt
You might as well say the entire NY + DC metro losses power and "never comes back up" What is the plan around that? The person replying is correct, most companies do not have a actionable plan for AWS never coming back up.
I worked at a medium-large company and was responsible for reviewing the infrastructure BCP plan. It stated that AWS going down was a risk, and if it happens we wait for it to come back up. (In a lot more words than that).
I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!
I've got to admit though, whenever I hear about having a backup plan I think having apples to apples copy elsewhere which is probably not wise/viable anyway. Perhaps having just enough to reach out to the service users/customers suffice.
Also I must add I am heavily influenced by a comment by Adrian Cockroft on why going multi cloud isn't worth it. He worked for AWS (at the time at least) so I should have probably reached to the salt dispenser.
Resilient systems work autonomously and can synchronize - but don't need to synchronize.
* Git is resilient.
* Native E-Mail clients - with local storage enabled - are somewhat resilient.
* A local package repository is - somewhat resilient.
* A local file-sharing app (not Warp/ Magic-Wormhole -> needs relay) is resilient if it uses only local WiFi or Bluetooth.
We're building weak infrastructure. A lot of stuff shall work locally and only optionally use the internet.But the web, that's the fragile, centralized and weak point currently, and seems to be what you're referring to rather.
Maybe nitpicky, but I feel like it's important to distinguish between "the web" and "the internet".
The word "seems" is doing a lot of heavy lifting there.
https://www.bbc.com/news/technology-57707530
That because people trust and hope blindly. They believe IT is for saving money? It isn’t. They coupled their cash registers onto an American cloud service. The couldn’t even pay in cash.
It usually gets worse, when not outages happens for some time. Because that increases blind trust.
Now we have computers that shit themselves if DNS isn’t working, let alone LANs that can operate disconnected from the Internet as a whole.
And partially working or indicating this it works (when it doesn’t) is usually even worse.
Yes the Internet has stayed stable.
The Web, as defined by a bunch of servers running complex software, probably much less so.
Just the fact that it must necessarily be more complex means that it has more failure modes...
And FWIW, "AWS is down"....only one region (out of 36) of AWS is down.
You can do the multi-region failover, though that's still possibly overkill for most.
But a large enough of "not too big to fail" companies become a too big to fail event. Too many medium sized companies have a total dependency on AWS or Google or Microsoft, if not several at the same time.
We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.
Second, preparing for the disappearance of AWS is even more silly. The chance that it will happen are orders of magnitude smaller than the chance that the cost of preparing for such an event will kill your business.
Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?
How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?
> Second, preparing for the disappearance of AWS is even more silly.
What's silly is not thinking ahead.
That's the main topic that going through my mind lately, if you replace "my website" with "Wikimedia movement".
We need a far better social, juridical and technical architecture regarding resilience as hostil agendas are in the rise at all level agaisnt sourced trackable global volunteer community knowledge bases.
For small and medium sized companies it's not easy to perform an accurate due diligency.
For you as a Mexican the end results is the same, AWS went away, and considering there already is a list of countries that cannot use AWS, GitHub and a bunch of other "essential" services, it's not hard to imagine that that list might grow in the future.
Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.
The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.
Not companies, the protocols are decentralized and at some point it was mostly non-companies. Anyone can hook up a computer and start serving requests which was/is a radical concept, we've lost a lot, unfortunately
We have put more and more services on fewer and fewer vendors. But that's the consolidation and cost point.
But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.
As long as the outages are rare enough and you automatically fail over to a different region, what's the problem?
Even connectivity has it's points of failure. I've touched with my own hands fiber runs that, with a few quick snips from a wire cutter, could bring sizable portions of the Internet offline. Granted that was a long time ago so those points of failure may no longer exist.
Be it a company or a state, concentration of power that exeeds the needs for their purpose to function by a large margin is always a sure way to spread corruption, create feedback loop in single point of failure, and is buying everyone a ticket to some dystopian reality with a level of certainty that beats anything that a SLA will ever give us.
There is no reason to have such brittle infra.
Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…
I've actually had that.
Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.