The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.
Good reminder that you are only as strong as your weakest link.
The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!
It’s always DNS.
My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.
The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).
Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.
Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.
Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?
https://www.bbc.com/news/live/c5y8k7k6v1rt?post=asset%3Ad902...
However, if you desperately need to access it you can force resolve it to 3.218.182.212. Seems to work for me. DNS through HN
curl -v --resolve "dynamodb.us-east-1.amazonaws.com:443:3.218.182.212" https://dynamodb.us-east-1.amazonaws.com/
We're having fun figuring out how to communicate amongst colleagues now! It's when it's gone when you realise your dependence
Then things got worse. At 9:13 AM PT it sounds like they’re back to troubleshooting.
Honestly sounds like AWS doesn’t even really know what’s going on. Not good.
I think most sysadmin don't plan for AWS outage. And economically it makes sense.
But it makes me wonder, is sysadmin a lost art?
It's still missing the one that earned me a phone call from a client.
Amazon is burning and driving away the technical talent and knowledge knowing the vendor lock-in will keep bringing the sweet money. You will see more sales people hoovering around your c-suites and executives, while you will face even worse technical support, that seem not knowing what they are talking about, yet alone to fix the support issue you expect to be fixed easily.
Mark my words, and if you are putting your eggs in one basket, that basket is now too complex and too interdependent, and the people who built and knew those intricacies are driven away with RTOs, move to hubs. Eventually those services; all others (and also aws services themselves) heavily dependent on, might be more fragile than the public knows.
My website is down :(
(EDIT: website is back up, hooray)
> Oct 20 3:35 AM PDT
> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.
AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.
Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”
The costs, performance overhead, and complexity of a modern AWS deployment are insane and so out of line with what most companies should be taking on. But hype + microservices + sunk cost, and here we are.
Lots of orgs operating wholly in AWS and sometimes only within us-east-1 had no operational problems last night. Some that is design (not using the impacted services). Some of that is good resiliency in design. And some of that was dumb luck (accidentally good design).
Overall, those companies that had operational problems likely wouldn't have invested in resiliancy expenses in any other deployment strategy either. It could have happened to them in Azure, GCP or even a home rolled datacenter.
Obviously, some services are only available in us-east-1, but many applications can gain some resiliency just by making a primary home in any other region.
Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.
I think we're doing the 21st century wrong.
Probably makes sense to add "relies on AWS" to the criteria we're using to evaluate 3rd-party services.
At least when us-east is down, everything is down.
Very big day for an engineering team indeed. Can't vibe code your way out of this issue...
One strange one was metrics capturing for Elasticache was dead for us (I assume Cloudwatch is the actual service responsible for this), so we were getting no data alerts in Datadog. Took a sec to hunt that down and realize everything was fine, we just don't have the metrics there.
I had minor protests against us-east-1 about 2.5 years ago, but it's a bit much to deal with now... Guess I should protest a bit louder next time.
This is why distributed systems is an extremely important discipline.
Seems to be really limited to us-east-1 (https://health.aws.amazon.com/health/status). I think they host a lot of console and backend stuff there.
There are very few businesses that genuinely cannot handle an outage like this. The only examples I've personally experienced are payment processing and semiconductor manufacturing. A severe IT outage in either of these businesses is an actual crisis. Contrast with the South Korean government who seems largely unaffected by the recent loss of an entire building full of machines with no backups.
I've worked in a retail store that had a total electricity outage and saw virtually no reduction in sales numbers for the day. I have seen a bank operate with a broken core system for weeks. I have never heard of someone actually cancelling a subscription over a transient outage in YouTube, Spotify, Netflix, Steam, etc.
The takeaway I always have from these events is that you should engineer your business to be resilient to the real tradeoff that AWS offers. If you don't overreact to the occasional outage and have reasonable measures to work around for a day or 2, it's almost certainly easier and cheaper than building a multi cloud complexity hellscape or dragging it all back on prem.
Thinking in terms of competition and game theory, you'll probably win even if your competitor has a perfect failover strategy. The cost of maintaining a flawless eject button for an entire cloud is like an anvil around your neck. Every IT decision has to be filtered through this axis. When you can just slap another EC2 on the pile, you can run laps around your peers.
Resolves to nothing.
Other AWS services should be able to survive this kind of interruption by rerouting requests to other AZs. Big company clients might also want to test against these kinds of scenarios.
https://health.aws.amazon.com/health/status?path=service-his...
"I think if you look at what matters to customers, what they care they care a lot about what the operational performance is, you know, what the availability is, what the durability is, what the latency and throughput is of of the various services. And I think we have a pretty significant advantage in that area." also "And, yeah, you could just you just look at what's happened the last couple months. You can just see kind of adventures at some of these players almost every month. And so very big difference, I think, in security."
Perhaps for the internet as a whole, but for each individual service it underscores the risk of not hosting your service in multiple zones or having a backup
• Laying off top US engineering earners.
• Aggressively mandating RTO so the senior technical personnel would be pushed to leave.
• Other political ways ("Focus", "Below Expectations") to push engineering leadership (principal engineers, etc) to leave, without it counting as a layoff of course.
• Terminating highly skilled engineering contractors everywhere else.
• Migrating serious, complex workloads to entry-level employees in cheap office locations (India, Spain, etc).
This push was slow but mostly completed by Q1 this year. Correlation doesn't imply causation? I find that hard to believe in this case. AWS had outages before, but none like this "apparently nobody knows what to do" one.
Source: I was there.
Booting builder /usr/bin/docker buildx inspect --bootstrap --builder builder-1c223ad9-e21b-41c7-a28e-69eea59c8dac #1 [internal] booting buildkit #1 pulling image moby/buildkit:buildx-stable-1 #1 pulling image moby/buildkit:buildx-stable-1 9.6s done #1 ERROR: received unexpected HTTP status: 500 Internal Server Error ------ > [internal] booting buildkit: ------ ERROR: received unexpected HTTP status: 500 Internal Server Error
Other hosting services like Vercel, package managers like npm, even the docker registeries are down because of it.
Weird that case creation uses the same region as the case you'd like to create for.
Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.
Modern companies live life on the edge. Just in time, no resilience, no flexibility. We see the disaster this causes whenever something unexpected happens - the Evergiven blocking Suez for example, let alone something like Covid
However increasingly what should be minor loss of resilience, like an AWS outage or a Crowdstrike incident, turns into major failures.
This fragility is something government needs to legislate to prevent. When one supermarket is out that's fine - people can go elsewhere, the damage is contained. When all fail, that's a major problem.
On top of that, the attitude that the entire sector has is also bad. People thing IT should tail once or twice a year and it's not a problem. If that attitude affect truly important systems it will lead to major civil projects. Any civilitsation is 3 good meals away from anarchy.
There's no profit motive to avoid this, companies don't care about being offline for the day, as long as all their mates are also offline.
AWS makes their SLAs & uptime rates very clear, along with explicit warnings about building failover / business continuity.
Most of the questions on the AWS CSA exam are related to resiliency .
Look, we've all gone the lazy route and done this before. As usual, the problem exists between the keyboard and the chair.
[1] https://bitbucket.status.atlassian.com/incidents/p20f40pt1rg...
Humans have built-in redundancy for a reason.
Always DNS..
Came here after the Internet felt oddly "ill" and even got issues using Medium, and sure enough https://status.medium.com
It's ridiculous how everything is being stored in the cloud, even simple timers. It's past high time to move functionality back on-device, which would come with the advantage of making it easier to de-connect from big tech's capitalist surveillance state as well.
No landing page explaining services are down, just scary error pages. I thought account was compromised. Thanks HN for, as always, being the first to clarify what's happening.
Scary to see that in order to order from Amazon Germany, us-east1 must be up. Everything else works flawlessly but payments are a no go.
(The counter joke is, of course, "but that's `us-east-1` already! But I mean deliberately and frequently.)
Ah yes, the great AWS us-east-1 outage.
Half the internet’s on fire, engineers haven’t slept in 18 hours, and every self-styled “resilience thought leader” is already posting:
“This is why you need multi-cloud, powered by our patented observability synergy platform™.”
Shut up, Greg.
Your SaaS product doesn’t fix DNS, you're simply adding another dashboard to watch the world burn in higher definition.
If your first reaction to a widespread outage is “time to drive engagement,” you're working in tragedy tourism. Bet your kids are super proud.
Meanwhile, the real heroes are the SREs duct-taping Route 53 with pure caffeine and spite.
https://www.linkedin.com/posts/coquinn_aws-useast1-cloudcomp...
> 5000 Reddit users reported a certain number of problems shortly after a specific time.
> 400000 A certain number of reports were made in the UK alone in two hours.
Even the error message itself is wrong whenever that one appears.
Even @ 9:30am ET this morning, after this supposedly was clearing up, my doctor's office's practice management software was still hosed. Quite the long tail here.
Lost data, revenue, etc.
I'm not talking about AWS but whoever's downstream.
Is it like 100M, like 1B?
Appears to have happened within the last 10-15 minutes.
I think I might be ready to build out a replacement through vibe coding. I don’t like being dependent on user submissions though. I feel like that’s a challenge on its own.
You're gonna hear mostly complaints in this thread, but simple, resilient, single-region architecture is still reliable as hell in AWS, even in the worst region.
Edit: I can login into one of the AWS accounts (I have a few different ones for different companies), but my personal which has a ".edu" email is not logging in.
My refusal to hoard every asset into AWS (let alone put anything of import in us-east-1) has saved me repeatedly in the past. Diversity is the foundation of resiliency, after all.
It's always DNS...
https://www.nytimes.com/2025/05/25/business/amazon-ai-coders...
"Pushed to use artificial intelligence, software developers at the e-commerce giant say they must work faster and have less time to think."
Every bit of thinking time spent on a dysfunctional, lying "AI" agent could be spent on understanding the system. Even if you don't move your mouse all the time in order to please a dumb middle manager.
Impacting all banking series with red status error. Oddly enough, only their direct deposits are functioning without issues.
A lot of businesses have all their workflows depending on their data on airtable.
Signal was also down.
It's not difficult, it's just that we engineers chose convenience and delegated uptime to someone else.
[1] - https://usetrmnl.com
Not just AWS, but Cloudflare and others too. Would be interesting to review them clinically.
"upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout"
I think no matter how hard you try to avoid it, in the end there's always a massive dependency chain for modern digital infrastructure[^2].
[1]: https://itsfoss.community/uploads/default/optimized/2X/a/ad3...
There aren't any communities on Reddit with that name. Double-check the community name or start a new community.
"It's been on the dev teams list for a while"
"Welp....."
That means Cursor is down, can't login.
Ladies and Gentleman's it's about time to learn reshoring in the IT world as well. Owning nothing, renting all means extreme fragility.
So your complaints matter nothing because "number go up".
I remember the good old days of everyone starting a hosting company. We never should have left.
How the hell did Ring/Amazon not include a radio-frequency transmitter for the doorbell and chime? This is absurd.
To top it off, I'm trying to do my quarterly VAT return, and Xero is still completely borked, nearly 20 hours after the initial outage.
"But you can't do webscale uptime on your own"
Sure. I suspect even a single pi with auto-updates on has less downtime.
> “The Machine,” they exclaimed, “feeds us and clothes us and houses us; through it we speak to one another, through it we see one another, in it we have our being. The Machine is the friend of ideas and the enemy of superstition: the Machine is omnipotent, eternal; blessed is the Machine.”
..
> "she spoke with some petulance to the Committee of the Mending Apparatus. They replied, as before, that the defect would be set right shortly. “Shortly! At once!” she retorted"
..
> "there came a day when, without the slightest warning, without any previous hint of feebleness, the entire communication-system broke down, all over the world, and the world, as they understood it, ended."
- https://status.twilio.com/ - https://www.intercomstatus.com/us-hosting
I want the web ca. 2001 back, please.
(Useless service status pages are incredibly annoying)
"Do we enable DR? Yes/No". That's all you can do. If you do, it's a whole machinery starting, which might take longer than the outage itself.
They can't even use Slack to communicate - messages are being dropped/not sent.
And then we laugh at the South Koreans for not having backed up their hard drives (which got burnt by actual fire, a statistically way less occurring event than an AWS outage). OK that's a huge screw up, but hey, this is not insignificant either.
What will happen now? Nothing, like nothing happened after Crowdstrike's bug last year.
I wss under the impression that having multiple available zones guarantees high availability.
It seems this is not the case.
Right now on levels.fyi, the highest-paying non-managerial engineering role is offered by Oracle. They might not pay the recent grads as well as Google or Microsoft, but they definitely value the principal engineers w/ 20 years of experience.
Economic efficiency and technical complexity are both, separately and together, enemies of resilience
It might be an interesting exercise to map how many of our services depend on us-east-1 in one way or another. One can only hope that somebody would do something with the intel, even though it's not a feature that brings money in (at least from business perspective).
And they give you a much better developer experience...
Sigh
FFS ...
Entire regions go down
Don't pay for intra-az traffic friends
I am the CEO of the company and started it because I wanted to give engineering teams an unbreakable cloud. You can mix-n-match services of ANY cloud provider, and workloads failover seamlessly across clouds/on-prem environments.
Feel free to get in touch!
Our applications and databases must have ultra high availability. It can be achieved with applications and data platforms hosted on different regions for failover.
Critical businesses should also plan for replication across multiple cloud platforms. You may use some of the existing solutions out there that can help with such implementations for data platforms.
- Qlik replicate - HexaRocket
and some more.
Or rather implement native replication solutions available with data platforms.