AWS multiple services outage in us-east-1 (opens in new tab)

(health.aws.amazon.com)

2246 pointskondro7mo ago2057 comments

2057 comments

Interesting day. I've been on an incident bridge since 3AM. Our systems have mostly recovered now with a few back office stragglers fighting for compute.

The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.

Good reminder that you are only as strong as your weakest link.

15 more replies

0x5345414e7mo ago

This is having a direct impact on my wellbeing. I was at Whole Foods in Hudson Yards NYC and I couldn’t get the prime discount on my chocolate bar because the system isn’t working. Decided not to get the chocolate bar. Now my chocolate levels are way too low.

6 more replies

JCM97mo ago

Have a meeting today with our AWS account team about how we’re no longer going to be “All in on AWS” as we diversify workloads away. Was mostly about the pace of innovation on core services slowing and AWS being too far behind on AI services so we’re buying those from elsewhere.

The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!

14 more replies

nikolay7mo ago

Choosing us-east-1 as your primary region is good, because when you're down, everybody's down, too. You don't get this luxury with other US regions!

8 more replies

indoordin0saur7mo ago

Seems like major issues are still ongoing. If anything it seems worse than it did ~4 hours ago. For reference I'm a data engineer and it's Redshift and Airflow (AWS managed) that is FUBAR for me.

23 more replies

stepri7mo ago

“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.”

It’s always DNS.

10 more replies

mlhpdx7mo ago

Cool, building in resilience seems to have worked. Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted (not sure if it would have been anyway).

My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.

The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).

Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.

Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.

3 more replies

chibea7mo ago

One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?

Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?

5 more replies

haunter7mo ago

The Premier League said there will be only limited VAR today w/o the automatic offside system becasue of the AWS outage. Weird timeline we live in

https://www.bbc.com/news/live/c5y8k7k6v1rt?post=asset%3Ad902...

2 more replies

sammy22557mo ago

Can't resolve any records for dynamodb.us-east-1.amazonaws.com

However, if you desperately need to access it you can force resolve it to 3.218.182.212. Seems to work for me. DNS through HN

curl -v --resolve "dynamodb.us-east-1.amazonaws.com:443:3.218.182.212" https://dynamodb.us-east-1.amazonaws.com/

4 more replies

emrodre7mo ago

Their status page (https://health.aws.amazon.com/health/status) says the only disrupted service is DynamoDB, but it's impacting 37 other services. It is amazing to see how big a blast radius a single service can have.

5 more replies

Aachen7mo ago

Signal is down from several vantage points and accounts in Europe, I'd guess because of this dependence on Amazon overseas

We're having fun figuring out how to communicate amongst colleagues now! It's when it's gone when you realise your dependence

4 more replies

JCM97mo ago

At 3:03 AM PT AWS posted that things are recovering and sounded like issue was resolved.

Then things got worse. At 9:13 AM PT it sounds like they’re back to troubleshooting.

Honestly sounds like AWS doesn’t even really know what’s going on. Not good.