AWS Cognito is having issues and health dashboards are still green (opens in new tab)

(status.aws.amazon.com)

492 pointsrcardo115y ago349 comments

349 comments

We hired an engineer out of Amazon AWS at a previous company.

Whenever one of our cloud services went down, he would go to great lengths to not update our status dashboard. When we finally forced him to update the status page, he would only change it to yellow and write vague updates about how service might be degraded for some customers. He flat out refused to ever admit that the cloud services were down.

After some digging, he told us that admitting your services were down was considered a death sentence for your job at his previous team at Amazon. He was so scarred from the experience that he refused to ever take responsibility for outages. Ultimately, we had to put someone else in charge of updating the status page because he just couldn't be trusted.

FWIW, I have other friends who work on different teams at Amazon who have not had such bad experiences.

uji5y ago

Have worked at AWS before, and I can attest to this. Whenever we had an outage, our director and senior manager would take a call on whether to update the dashboard or not.

Having 'red' dashboard catches lot of eyes, so people responsible for making this decision always look at it from political point of view.

As a dev oncall, we used to get 20 sev2s per day (an oncall ticket which needs to be handled within 15 mins) so most of the time things are broken, its just that its not visible to external customers through dashboard.

etaioinshrdlu5y ago

Wow. If I were in charge, the team running a service should not be the same team who decides whether a given service is healthy. This is pretty damaging info about the unprofessional way AWS actually appears to be run.

alfalfasprout5y ago

It's funny that you point to that as the problem. The problem is more AWS' toxic engineering culture that has engineers fearing for their jobs in a way that guides their decision making. It's bad company culture, end of story.

2 more replies

dilyevsky5y ago

Guess what - most cloud providers are like that. My personal experience is with GCP where stuff can be majorly on fire and no status update for hours. Cloud SLOs are lies like a lot of other things there

1 more reply

foota5y ago

I think this is pretty typical, as often outsiders don't have the visibility into the issue to determine whether there's an issue.

ta202007105y ago

The ec2 or s3 dashboards showing red literally requires approval from ajassy himself irrc

The status page is entirely manually updated.

ipsocannibal5y ago

Flipping anything to red entails significant legal and business complications. For starters you are basically admitting that customers deserve a refund for services not provided. Im not surprised that execs must be involved in that decision. You don't want random developer making a decision that could incur millions of dollars in potential loses when there are other strictly non-techincal factors to consider.

3 more replies

totaldude875y ago

wow, so much to their "leadership principles" , the first one being "customer obsession" and "earning trust", from what I see, this doesn't accomplish either :|

o-__-o5y ago

I’ve got another good FAANG principal joke:

“Don’t be evil”

buys doubleclick

outworlder5y ago

No idea what happens on AWS as I don't work there, but I have another perspective on this.

There are perverse incentives to NOT update your status dashboard. Once I was asked by management to _take our status dashboard down_ . That sounded backwards, so I dug a bit more.

Turns out our competitor was using our status dashboard as ammo against us in their sales pitch. Their claim was that we had too many issues and were unreliable.

That was ironic, because they didn't even have a status dashboard to begin with. Also, an outage on their system was much more catastrophic than an outage on our system. Ours was, for the most part, a control plane. If it went down, customers would lose management abilities for as long as the outage persisted. An outage at our competitor, meanwhile, would bring customer systems down.

We ended up removing the public dashboard and using other mechanisms to notify customers.

necubi5y ago

Yeah, had the same experience at a previous company. It's very frustrating that your transparency gets used against you by unscrupulous competitors.

Razengan5y ago

How is it unscrupulous?

This sort of shit happens all the time at all levels. Companies use each other’s public specs in their competition all the time.

Or capitalizing on features like headphone jacks etc. in their ads before proceeding to remove them from their own products anyway (Samsung and Google) and so on.

4 more replies

social_quotient5y ago

GoGrid used to do this to Rackspace cloud back in the early cloud days. It always left a bad taste for me seeing a social campaign at customers who are currently down.

ipsocannibal5y ago

Imagine your competitors being a couple of smallish companies like Microsoft, Google, and Oracle. Oracle would sacrifice puppies live on YouTube if that would take AWS down a peg.

NicoJuicy5y ago

I'd monitor the competition and use it to your advantage.

Aperocky5y ago

That's opposite of my experience at AWS. It's likely that the culture at AWS has changed over the past few years, it's also likely that there's a difference in culture between teams.

iameli5y ago

Having talked to dozens of Amazon engineers, the only consistent picture I've formed in my head is that the culture varies wildly between teams. The folks on the happiest teams are always aghast at hearing the horror stories.

chickenpotpie5y ago

800,000 person company has to vary from team to team

thayne5y ago

As a customer, it does seem consistent that the status dashboard doesn't say a service is down until it has been down for quite a while.

PragmaticPulp5y ago

I have no doubt it varies from team to team. Like I said, my other friends at Amazon had more positive experiences.

I assume there's some selection bias going on whenever we're able to hire people out of FAANG companies. We compensated similarly, but in theory had a lower promotion ceiling simply because we weren't FAANG. I assume he wanted out of Amazon because he wasn't on a great team there.

throwaway20485y ago

One has to be wary of the differences between what is said in places like the employee handbook and espoused as official policy, and what actually ends up happening.

AWS and amazon in general espouse all sorts of values relating to taking responsibility and owning problems.

Whats left unstated is that the management structure hammers you to the wall as soon as they find somebody to blame.

patch_cable5y ago

Echoing what Aperocky said, I worked for Amazon for about 10 years, across a number of different teams. Amazon has its share of problems, but assigning blame for outages was not one of them.

During my 10 years, I had multiple opportunities to break and then fix things. The breaking was always looked at as "these things happen" while the fixing was always commended.

Aperocky5y ago

I'm speaking from experience, not handbooks or policies.

In fact, AWS is the least 'blame game' playing company I've worked at. The mindset of fix the problem and not to find some scapegoat is strong at least in my org, I really do appreciate this because it aligns with my personal belief.

1 more reply

Ensorceled5y ago

"Shooting the messenger" is so common that we, well, have a phrase for it.

geertj5y ago

There is a phrase for it, but it does not match my experience at AWS at all. (Source: been working there for 3.5 years now). Things break, we do COEs and we learn from them. If an issue was caused by operator error, the COE would look at what missing or broken processes caused the operator to be able to make this error in the first place.

Ensorceled5y ago

And yet we continue to have “all greens” on the status boards during outages.

JMTQp8lwXL5y ago

Eventually, anyone in that role would get fired. No service has an established 100% availability uptime when measured over its complete existence (welcome to any assertions challenging this, if anyone has any).

sebmellen5y ago

Bitcoin?

throwawayr1235y ago

Forks?

1 more reply

ind3mp0tent5y ago

I have heard stories like these before but it wasn’t clear to me that this is apparently a broader issue at AWS (reading the other comments). While I think that very short outages in line with SLAs must mot necessarily go public or have a post mortem, it is astonishing to see that some teams/managers go through lengths to hide this at the „primus“ of hyperscalers.

I always wonder how many more products AWS pushes out the door versus cleaning up and improving what the have already. Cognito itself is such a half-baked mess...

But back to topic, when should we update status pages? On every incident? Or when SLAs are violated?

mathattack5y ago

This sounds like a managerial incentives problem.

If a person or company’s compensation depends on not fessing up to problems, they won’t fess up to them.

jmartens5y ago

Blaming people/employees is bad. That said, the idea of not updating a status page quickly, to reflect reality, is a problem at almost every SaaS company in the world. As others have said, status page changes are political and impact marketing, they have very little to do with providing good, timely information to customers.

one2know5y ago

At amazon, admitting to a problem will guaranteed lead to having to open a COE, correction of error, which means meetings with executives, inevitable "least effective" rating, development plan, scapegoating, PIP, and firing.

felixgallo5y ago

I've caused and authored many COEs at Amazon, and additionally have been involved in maybe fifty for neighboring teams. I can't recall a time it had a negative career impact for anyone, much less any of the consequences you list.

stevepotter5y ago

I work at AWS now and can second that. Nobody is happy when things break, but COEs are looked a positively and are circulated constantly to prevent repeats.

1 more reply

patch_cable5y ago

Same. If anything, a well written COE has had positive career impact.

1 more reply

Jabberwock125y ago

I too have caused and authored many COEs at Amazon. I have also been involved in 50 to 100 COEs written by other teams and have also observed no instances of this having a negative impact on anyones career. COEs are core to Amazon's learning experience and you can be assigned to write one for simply being unlucky enough to be oncall when the incident occurred.

hyperdimension5y ago

>a negative career impact

Nothing against you personally of course, but I just have to congratulate whomever it was who came up with this gem of an euphemism. It's definitely going up there next to 'career-limiting move'.

ayberk5y ago

This is complete opposite of my experience at AWS. I'm one of the biggest critics of how we do "software" (just ask any of my managers), but in none of the orgs I've worked at COEs were ever used against you. On the contrary, a good COE is usually applauded.

tanilama5y ago

COE won't lead to inevitable LE, it only means your manager wants you to be the scapegoat or you are indeed responsible for it.

Resolving COE can be a positive even if you know how to spin it, at least that was the case when I was there. But not sure whether things had changed

Aperocky5y ago

Having authored COEs before, none of your sentence after 'which means' was true at least in my case, nor are they true for COEs that I know of.

calcifer5y ago

What the hell is a COE? I hate that nobody seems to bother defining their acronyms anymore.

ShroudedNight5y ago

(cause|correction) of error, though 'CoE' is the much better known identifier (like IBM vs International Business Machines).

They are a formal, in-depth retrospective on customer-impacting service degradations or outages. They include a thorough functional description of how the state of your service evolved into failure, a exhaustively recursive review of the operational decisions and assumptions that contributed to that failure, and a series of action items the team will take to ensure that the service will never fail again for the same reason.

Edit: This list is incomplete, and the link included in the sibling provides a better, more thorough description.

ttam5y ago

https://wa.aws.amazon.com/wat.concept.coe.en.html

TheHydroImpulse5y ago

If anyone notices a problem you'll likely need to write a COE, there's no way to get around that. Not updating the status page absolutely doesn't get you out of that task.

COE also doesn't lead to negative marks on anyone at AWS that I know of. It's a learning experience to know why it happened and action items so it doesn't happen again.

uji5y ago

This is very true. It doesn't lead to PIP always but whole amazonian culture makes it difficult for the person to stay in team/company.

Writing COE is kind of admission of guilt and I have definitely seen promotions getting delayed. During perf-review, lot of times managers of other teams raise COE has a point against the person going for promotion.

hypervisorxxx5y ago

Or it could be a fluke. GCP went down in such a way the dashboard updates were not independent of the regions that went down so the dashboard was down because the region down it was reporting on was the region it was deployed on. I think that ended up being the root cause but yes another Sunday on call I didn't go to the gym and sat infront of my computer waiting for updates that never came. What's worse is when they say they will update at a certain time next and then no update is made.

Even if you don't know what to say still update saying that so the rest of us can report to our teams and make decisions about our own worklives and personal lives.

piewzko5y ago

Now is probably a good time to plug some of the open source alternatives to vendor locked in identity solutions:

- https://github.com/ory

- https://github.com/dexidp/dex

- https://github.com/authelia/authelia

- https://github.com/keycloak/keycloak

- https://www.gluu.org/

- https://github.com/accounts-js/accounts

grinich5y ago

Shameless plug for WorkOS. (I'm the founder. Hope that's still ok on HN!)

We're like Stripe for SSO/SAML auth. Docs here: https://workos.com/docs

Here's our HN launch: https://news.ycombinator.com/item?id=22607402

kevindong5y ago

I'd expect Amazon to be better able to maintain uptime than a self-hosted option at most (but not all) companies.

rhizome5y ago

Amazon can't diversify their providers, though.

Regular Joes like us can use AWS, GCE, on premises, some non-reseller colocation provider, etc., and create failover duplicates, alternative deploy targets, or simply not ever have a complete outage due to the unlikelihood of all of these things failing at once.

andrewstuart25y ago

They diversify, they just do it at a completely different layer than a cloud consumer.

simlevesque5y ago

- https://www.etebase.com/

lukevp5y ago

Fusionauth is pretty cool. I’ve worked with the team a bit on the .net core support.

mooreds5y ago

This post from one of our customers about moving from Cognito to FusionAuth may be of interest: https://fusionauth.io/blog/2020/11/18/reconinfosec-fusionaut...

Disclosure: I'm an employee of FusionAuth, and while there is a forever free community edition, it is free as in beer, not as in speech.

xyst5y ago

im surprised companies still want to build their own identity system or pay companies (ping, auth0) to host it for them

ory looks like a really good project

technics2565y ago

Anyone have thoughts on their experience with keycloak?

mooreds5y ago

I haven't used it, but heard it's ... complex to get set up and run. (Again, I work for a competitor.)

Here's a reddit with a bunch of posts you could sift through: https://www.reddit.com/r/KeyCloak/

agustif5y ago

add AccountsJS, a small nice modular typescript/js lib for building account systems easily

rcardo11OP5y ago

> This is also causing issues with Amplify, API Gateway, AppStream2, AppSync, Athena, Cloudformation, Cloudtrail, Cloudwatch, Cognito, DynamoDB, IoT Services, Lambda, LEX, Managed BlockChain, S3, Sagemaker, and Workspaces.

Well, this is a major outgage

Schweigi5y ago

Indeed, we had the first AWS Kinesis issues already at 13:50 (UTC). Now it's still ongoing after two hours. The status page didn't even update in the first 45 min or so...

jjoonathan5y ago

That's typical. The AWS status page is a marketing gimmick whose job is to stay green, not a good faith attempt to assess and report status. If there's an outage, seeing it accurately reflected on the status page is the exception, not the rule.

simlevesque5y ago

Isn't that fraud ?

edit: not sure why my question deserved a downvote...

1 more reply

chizhik-pyzhik5y ago

Updating the status dashboard is pretty low priority for operators trying to resolve this issue. It requires escalation up the management chain and careful wording.

3 more replies

tootie5y ago

As of this moment, there are more non-green services than I've ever seen. And it's steadily getting worse.

EDIT: 15 minutes later and the board is looking worse again.

manishsharan5y ago

Thanks for this. My Lambda@Edge function was not working and I thought I broke something my permissions even though I had not touched that for atleast a month. This is the very "helpful" error message

The Lambda function associated with the CloudFront distribution is invalid or doesn't have the required permissions. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner. If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.

booleanbetrayal5y ago

This is also affecting Fargate (at least EKS) in that its scheduling system is broken. No way to get new pods.

zxcvbn40385y ago

Fargate console is reporting no capacity in us-east-1 which is a bummer because I've lost several services that got spun down apparently due to missing cloudwatch data. But EC2 appears to be working though its taking noticeably longer to create resources. I think the take-away for a lot of people is that multiple availability zones is not a substitute for proper BCP that encompasses multiple regions or cloud providers.

sethhochberg5y ago

Same story in ECS. Seems like virtually anything Fargate can't spawn new instances.

jpp5y ago

We're also seeing issues with FarGate ECS -- the task we had with auto-scaling scaled down to 0. The one we had with a fixed number of workers is fine.

mcintyre19945y ago

Thanks for mentioning this - that's a nasty failure mode.

odiroot5y ago

It's always a DNS issue.

mdaniel5y ago

That's a tough one -- I'm usually with you that it's always either DNS or cert expiry, but my go-to "it's always ..." when discussing AWS is: it's always security groups

Heh, maybe they accidentally locked themselves out of IAM, since those are great fun to troubleshoot, also

bart_spoon5y ago

I'm also seeing weirdness with Batch. Its working, but the dashboards aren't showing job statuses accurately and jobs aren't always terminating.

cpufry5y ago

and this is whats disclosed to the public

durkie5y ago

seeing issues with scaling up/down in elastic beanstalk too

holler5y ago

yep, iot in us-east-1 not working for me

driverdan5y ago

Five hours later and nothing has changed. For a company like Amazon this should be unacceptable.

Before someone replies and says use a different AZ, that's not possible for everyone. If you use a 3rd party service that is hosted on us-east-1 you can't do anything about it. For example, many Heroku services are broken because of this.

ttam5y ago

I can imagine that there are literally 100s of engineers involved in trying to fix this ASAP, since this is not only bringing down the systems of external customers, but also critical internal systems, plus the bad PR.

All on the eve of thanksgiving.

Bombthecat5y ago

I think the deeper problem is the interconnectivity between services and their apis. It's too complicated to maintain...

one2know5y ago

Amazon was at least aware enough to recognized that AWS circular dependencies were a bad thing. From what I heard they had to make changes. A big problem is the largest services like S3. If part of S3 were to use DynamoDB and DynamoDB used S3, then if one goes down, they might never restart either service. There is strong manager incentive at Amazon to build on other services as a way to ingratiate with other managers and VP's in the company. Unfortunately it leads to circular dependencies.

dodobirdlord5y ago

Conveniently this gets tested during every new region launch. Each service is brought online in a sequence, and each service can only use other services running in the same region, which guarantees that no two services can be mutual startup dependencies. Sometimes region build-outs have to be paused when a circular dependency is discovered that has been introduced since the last region launch!

ultimoo5y ago

Fascinating, I hadn't even considered how the org design and incentives in place internally at AWS affects the way some of the outward facing services are designed. Is an example say an up and coming director wanted to build a new service that depends on an existing service to curry favor? Do you have more examples or anecdotes to share?

2 more replies

ben5095y ago

Yeah, the cascading failure of all the other services is a deep architectural issue.

Having lots of services that do one thing and one thing well makes a lot of sense. Breaking them out into separate components brings a level of visibility into the system. And it's AWS's whole business model.

But it does mean that, fundamentally, service X is available when and only when (WAOW?) services A, B, C, etc. are all available. Its uptime is no greater than min(uptime(A), uptime(B), etc)

I'm trying to rework the authentication for our application and integrate it with our parent company's systems. As we talk to other teams, I see all these architecture diagrams where the solution to every problem is Yet Another Service, to where you're running a real rube goldberg machine.

adewinter5y ago

Interested to know what the alternative might be and why it would mean better uptime

1 more reply

donor205y ago

Seriously, I get that something falls over. But to have it be 5 hours to recover, for a service this critical is nuts

hallway_monitor5y ago

Pretty critical to other Amazon services too. We use Merchant Web Services to import orders from Amazon. Down since 9:30 AM. At this point we have thousands of orders we are unable to import and process.

jlmorton5y ago

More like ten hours at this point for Kinesis

that_guy_iain5y ago

This is what SLAs are for. Especially if you're using a 3rd party service.

avh025y ago

Yes, but those cute SLAs don't help your own customers who missed/lost/delayed things. Their downtime is your downtime, and with vendor lock in it means it's harder to just march elsewhere when there's a problem.

that_guy_iain5y ago

They don't help your customers, but they help you recover the lost revenue. The penalties we have in our SLA result in our company doing things they really don't want to pay.

1 more reply

bithavoc5y ago

"I want to have an AWS region where everything breaks with high frequency..."[0] discussed here [1]

[0] https://twitter.com/apgwoz/status/1292519906433306625?s=20

[1] https://news.ycombinator.com/item?id=24103746

maletor5y ago

Isn't that just called us-east-1?

riyadparvez5y ago

I've read this multiple times that AWS us-east-1 region is the one that has the highest number of outages. I am eager to hear others' experiences here.

glenngillen5y ago

People are just projecting their own cognitive biases.

As Werner has said before everything fails all the time, so you need to design your system/architecture to accept that constant. US-east-1 is by far the largest of the regions, and at that scale you can probably assume that at any given point in time there is hardware in there failing that needs to be physically replaced. As a result it's the region most well equipped to tolerate that level of constant failure (it's got 6 AZs!). It's also the the most popular of the regions, is typically one of the launch regions for new services, and runs a bunch of critical Amazon infra too. If anything it holds a special place in terms of importance for AWS to keep it up because the impact of a widespread problem here is amplified. For the same reason though any problem here is much more visible across the entire internet. Which is why the handful of outages are so memorable to people.

WrtCdEvrydy5y ago

us-east-1 is the zone with highest load and most new services are tested there first.

rumor has it, some of the older hardware is moved there and that's why prices are a little cheaper but I have not been able to confirm that.

2 more replies

jschulenklopper5y ago

`us-wtf-1`

file5y ago

ive... never loved a region before

turdnagel5y ago

Isn't it common practice to host your status board on someone else's infrastructure?

In 2017 there was an S3 issue that supposedly affected their ability to post. I believe they said that they were updating how they posted to the status board so that there would no longer be a dependency on S3. Well, I guess whatever they're dependent on now broke.

tuwtuwtuwtuw5y ago

It's common practice for small players but Amazon, Microsoft Azure and Google Cloud host their status pages on their own servers because they value the marketing aspect higher than a functioning status page for their customers.

Frost1x5y ago

I find it surprising how many people forget how much underlying business motives drive pretty much every action they make and how this is quickly forgotten by many.

No matter how much you value science and engineering, it ultimately doesn't matter to the business unless that aligns directly with their revenue stream. Sometimes it does, sometimes it doesn't.

tuwtuwtuwtuw5y ago

Yes. But I wonder if self-hosting their status page is really the correct decision from a marketing perspective. The people who consumes the status page on say Google Cloud probably know that Google self-hosting it is a bad decision from a technical point of view. So to the only people who care, their choice appear stupid.

So I don't really understand what they gain by doing it. I think maybe I am wrong about it being a marketing concern and that the choice is more related to internal politics and incompetent management.

2 more replies

srveale5y ago

Reminds me of: "When a measure becomes a target, it ceases to be a good measure"

When you're advertising uptime/availability, you're motivated not to report downtime/unavailability. Then the value of such reports is lost; developers start banging their heads trying to figure out if it's a service outage or a bug in their software (yes, informed by personal experience).

brown9-25y ago

The marketing aspect of what? No one is choosing a vendor based on where they store their status page

colinbartlett5y ago

I operate StatusGator, which is a service that aggregates status pages so I'm ALL TOO familiar with the AWS status page.

The main change they made in 2017 was the ability to post a message at the top of the page that is independent of the status of the individual items below. IIRC, it was the items they couldn't update. So that is kind of a hack, but it works.

It would be ideal if it was host entirely on completely separate infrastructure, and even a separate domain, but I won't hold my breath. Theirs is still more reliable than, for example, the IBM Cloud status page which was hard down during their epic outage back in June.

WrtCdEvrydy5y ago

S3 East didn't affect the ability but they couldn't swap out the green checkmark for the red checkmark... which is just hilarious.

zucked5y ago

That day was a nightmare for a lot of people - it wasn't just S3 that went down, it was like all of US-EAST.

Luckily my company decided against multi-az for the cost savings so I spent all day firefighting.

actionowl5y ago

Multi-AZ doesn't help when a whole region is down, unless you're referring to multi-region AZs (e.g us-east-1a and us-west-1a)

rhizome5y ago

I have to think they're talking about the latter.

eternalban5y ago

So what’s the cost breakdown? Did they make the right decision?

dexterdog5y ago

For one day of his time and probably a small part of a day of diminished service, most likely.

jmartens5y ago

In a world where we can do virtually anything we want with technology, why do we rely on vendors updating their own individual status pages?

throwaway3434325y ago

Large-scale events (LSEs) are becoming more and more common. It'll keep getting worse.

AWS has to take a hard look at how they build their software. Their bad engineering practices will eventually catch up to them. You can't treat AWS the same as Alexa. Sometimes it's smarter to take your time to ship stuff instead of putting it out there. Burning out your oncall engineers is not a feasible long-term plan.

AWS will be in deep trouble when/if GCE fixes their customer support.

ipsocannibal5y ago

"Large-scale events (LSEs) are becoming more and more common." Stats on this?

You seem to have insight on AWS's engineering practices. From your point of view what should be changed?

s_dev5y ago

Can anyone explain why status pages are so difficult. Theres even statups like status.io dedicated to this one thing.

It really does seem that anytime there is an outage more often than not the status page is showing all green traffic lights. Making it redundant as a tool to corroborate whats happening.

How did AWS status page compare with status.io/aws?

xyzzy1235y ago

When your company gets sufficiently large, outages become political.

Failure happens at the speed of computing but agreeing that something is failing in a way that customers need to be told about is a slower process.

Even when status pages are fully automatic (rather than manually updated), there will tend to be gaming of the metrics that constitute that.

Ideally you would just be monitoring your SLOs and publishing that to customers... that doesn't seem to be how it works, anywhere.

freehunter5y ago

And not just outages, but security incidents. I’ve worked at/with/for many companies as both an employee and a consultant where the top priority wasn’t to have fewer security incidents, but to have fewer security incidents that would require disclosure.

Publicly disclosing an incident to a customer is embarrassing and potentially damaging but almost equally as damaging is telling other teams you had an incident. Now anything that goes wrong is your fault by default because “it’s probably related to that incident” and any new security policies are blamed on the other team: “we wouldn’t have to do that if Ops didn’t mess up last month”.

The answer to “is this service suffering an outage” is seriously complex and hard to determine. The answer to “is this a security incident” is 10x harder and 100x more political because the industry is still just so wildly immature.

drchopchop5y ago

Additionally, you're penalized for doing it "right", because you're often competing against companies which rarely say that anything's wrong (ahem, Mailchimp). You look worse, because you're being transparent about service status, which creates the perception that you're generally less stable.

dexterdog5y ago

All of those are reasons that the determination of status should be totally independent of the company technically and legally.

PragmaticPulp5y ago

Many companies tie uptime and outages to performance reviews, either directly or indirectly.

Admitting that your services are down could be costly to your career progression and bonus. When people know this, they go to great lengths to avoid admitting fault. Updating the status page is the first admission of fault. The longer the status page shows an outage, the worse it gets.

I worked with an ex-Amazon engineer at a previous company. After each outage, he would spend days or weeks writing long reports explaining how the outage was not his fault. He didn't care so much about downtime so much as not getting blamed for outages. Predictably, this was terrible for team morale and most of his team members ended up quitting.

If anyone else finds themselves in this position, the solution is have another team responsible for monitoring uptime, and to rate teams on how quickly they acknowledge outages. Once the response time and accuracy of your status page becomes a performance metric, people are less likely to play games with it.

Twirrim5y ago

> Can anyone explain why status pages are so difficult.

What is an outage? When does an outage reach sufficient scale that updating the status page is the right thing to do?

I used to work for AWS, and now work for another cloud provider.

One thing that's hard to communicate is the sheer scale that these services operate at, what that means architecturally, and how they tend to break.

Outages, even just slight degradation, occurring on a whole service scale are very rare. I would argue from my experiences there that most incidents affect less than 10% of any given service's customers. Whether it gets noticed in part depends on who is encompassed by that percentage.

What is very often the case is that a subset of customers get impacted to some degree during any given incident. That can be even things like single percentage of customers or less, but be an incident that has all hands to deck and the entire management chain of the service aware and involved in.

At what percentage do you draw the line and say "Yes we need this many percentage of our customers to be affected before we post a green-i" (AWS terminology for the first stage of failure notification).

How do you communicate that effectively to customers, in such a way that doesn't suggest your service is unreliable when it really isn't.

The moment you post a green-i or above, customers start blaming you and your service for problems with their infrastructure that are not caused by it. If you're looking to use a service and go look at the status history and see it filled with green-i or similar, are you likely to trust it? No. Even if those green-i's were for impacts on a limited subset of customers.

AWS wrestled with this a bunch about 5-6 years ago. There were no end of discussions during the weekly ops meetings with senior leadership, directors and engineers across the company. Everyone wants to do the right thing and make sure customers get an accurate picture about the health of the service, without giving the wrong impression.

In the end they opted to move towards having personal notifications for outages, and build tooling to help services quickly identify which customers are being affected by any particular incident and provide personalised status pages for them that can be way more accurate than any generalised status page.

beoberha5y ago

Exactly this. I work for a cloud provider and there has been a ton of push in the last year or so to develop customer communication teams and involve them at the first inkling of an outage. We can identify the subset of customers affected and contact them directly. Just publically saying there’s an outage would cause much more chaos.

Kinrany5y ago

Posting percentages instead of green/red would fix all of these, no?

Twirrim5y ago

Not really. People will automatically assume they were in that impacted percentage and that what was happening with their stuff was entirely AWS's fault.

opmac5y ago

It is kind of perplexing that AWS dogfoods its own status page. I remember during the massive S3 outage a few years ago that their status page remained green almost the entire time because the red/green/blue icons for the status was stored in... wait for it... S3.

You'd think they would have learned from that.

Twirrim5y ago

They did. It came up in the post incident report, and senior leadership kicked off work to have it run on its own distinct infrastructure so that this wouldn't happen again.

If you look at where the content on https://status.aws.amazon.com/ is actually hosted from you'll see things like the status icons are all hosted under the same domain, e.g. https://status.aws.amazon.com/images/status1.gif https://status.aws.amazon.com/images/status0.gif etc.

If you look at the source code for the site, you'll again see that everything is hosted from the same domain.

One of their main goals was to ensure that it could never go wrong that way again.

sleepybrett5y ago

Except they posted this: 7:30 AM PST: We are currently blue on Kinesis, Cognito, IoT Core, EventBridge and CloudWatch given an increase in error rates for Kinesis in the US-EAST-1 Region. It's not post on SHD as the issue has impacted our ability to post there. We will update this banner if there continue to be issues with the SHD.

(SHD being the Service Health Dashboard)

opmac5y ago

K so they avoided that problem, but something similar has obviously gone wrong again, considering that Kinesis had been partially or fully down for almost an hour before the status page got their first update.

And the fact remains that currently an outage of AWS's own infrastructure is impacting AWS's ability to status updates on its own status dashboard. It's just seems so... amateurish.

1 more reply

ti_ranger5y ago

> It is kind of perplexing that AWS dogfoods its own status page.

> You'd think they would have learned from that.

They did.

The page has been updated numerous times since the start of this incident.

opmac5y ago

From the status page:

> This issue has also affected our ability to post updates to the Service Health Dashboard.

Just seems so ridiculous that they have trouble reporting the impaired status of their system due to... the impaired status of that same system.

aledalgrande5y ago

it was 1.5 hours before the first service was put on yellow

latch5y ago

Status pages, like SLAs, are sales tool - not engineering tools. At best, they are there to help decision makers go through their checklist. At worse, they exist to deceive.

jmartens5y ago

1 million percent!

Which makes me wonder, why do we all rely on status pages rather than solve the problem ourselves in ways that don't require us to rely on the vendor?

Griffinsauce5y ago

This is why my instinct is to check Twitter feeds of the related service first. So far in several years of experience it has been more informative and helpful than a status has ever been. It's a sad state.

jessaustin5y ago

One never thought we'd see the day... Twitter, that storied home of the whales of fail, is the reliable service.

darkcha0s5y ago

I completely agree, but can we talk for a second how absurd it is charging 90$ for essentially a service that just pings your infrastructure?

acdha5y ago

Try undercutting it. At some point you’ll learn that the problem isn’t that simple, operations is a key part of the product and isn’t free, and people expect support for important services.

jmartens5y ago

Except, the option to ping a service in order to programmatically inform a status page is almost never used. The dirty secret of status pages is that they are almost always manually updated, typically only when a very high bar is met, and after senior managers, sometimes even comms people, approve it.

bak3y5y ago

Any time companies have SLA's where money is on the line if they admit they're having an outage, they're going to be delayed on updating a status page.

sailfast5y ago

It says on this outage page (as of 11:11 ET) that the problem with Kinesis is also causing problems updating this outage dashboard which may explain the delay?

swyx5y ago

there's also https://stop.lying.cloud/

cheeze5y ago

Ironically I can't even load that page

Thaxll5y ago

It's not that easy to quantify how down a service is at the scale of AWS, for example Cognito has issues, does it means every services that rely on that have issues, what is the impact etc...

bird_monster5y ago

They aren't difficult. Amazon has no interest in having a working status page. Amazon would prefer the appearance of always green checkmarks over actually having a status page.

zxcvbn40385y ago

I think we are learning everything that uses AWS Kinesis internally which is cool. It’s always fascinating to learn how AWS works on the backend.

salil9995y ago

I work at AWS. I can tell you surely enough it's not pretty or easy to work with. Design and architecture are great here but implementation of that is pretty crap...

sheeshkebab5y ago

Why use it then? (api is crap, uptime is crap, limits are crap... politics?)

that_guy_iain5y ago

Because business authorised it's use. The final say on using AWS doesn't belong to tech but busines and AWS is very good at the sales game. I went to one of their conferences and it was mostly business people and sales pitches.

cheeze5y ago

Money. Lack of alternatives.

Cheaper than GCP. Still less crappy than Azure.

zxcvbn40385y ago

Thats too bad, I always imagined the backend was as magical as what AWS users see. I still wish I could have a peek at how S3 works or IAM. Not enough to get a job at AWS - I know they'd fire me the first time I left early for a parent teacher conference or took a sick day, so why put myself in that position.

throwaway3434325y ago

The only thing magical about AWS' backend is how much manpower they can throw at things.

Amazon doesn't have a good engineering culture. It's all about shipping things as fast as possible. People get promoted an leave for other teams, and the new folk gets burned out due to on-call load while trying to fix crappy software they have inherited.

2 more replies

orf5y ago

I would be beyond fascinated at how IAM works under the hood.

drfritznunkie5y ago

Cognito is one of the most frustrating AWS services I have to work with, it is almost, but not quite, entirely unlike an SP.

We're using it to federate customer IDPs through user pools, but this ends up with customer configs being region specific.

Has anyone figured out how to set up Cognito in multiple regions without the hijinx of having the customer setup trusts for each region? Not to mention, while multiple trusts are I think possible with ADFS (not that I've tested it), I'm pretty sure that Okta doesn't support multiple trusts, so regardless of how many regions, we'd still be SOL there...

sk5t5y ago

Eh? Brokering amongst multiple trusts (and managing protocol transition) is almost the raison d'etre for lifting token issuance out of your app and into ADFS, Okta, Auth0, etc.

Of course you'll have to deal with home realm discovery--really need to go in with open eyes on that one.

drfritznunkie5y ago

Yes, but cognito endpoints and pools ids are regional and globally unique, and there is no way that I know of to setup duplicate userpools in multiple regions and have requests served by either region. That means the customer IDP side would need to have two different SAML apps configured for each region...

myleshenderson5y ago

This was shared with me today: https://medium.com/@nealrp/aws-cross-region-cognito-replicat...

2 more replies

sk5t5y ago

Ah, I see what you mean. It does seem like you'd want a more complex arrangement of trusts to keep things simple on the leaves; or else avoid using a product that requires generating a hundred scattered security authorities.

_0o6v5y ago

> It's not posted on SHD as the issue has impacted our ability to post there.

Is that not a massive catch-22 for a service dashboard?

snazz5y ago

This has happened a few times before, actually. Dogfooding is good, but not for status pages!

Cloudflare does it right for their status page (https://www.cloudflarestatus.com). They don't use Cloudflare itself for it (you can tell because /cdn-cgi/trace returns nothing), the actual backend is Atlassian Statuspage, their TLS certificate is issued by Let's Encrypt instead of Cloudflare itself, and it's on a completely separate domain for DNS purposes.

judge20205y ago

They do use their own registrar though:

  $ whois cloudflarestatus.com
  Registrar: Cloudflare, Inc.

toong5y ago

Have you checked https://www.githubstatus.com/ ? ;-)

lights01235y ago

GitHub doesn't own their own datacenters.

freehunter5y ago

Reminds me of a recent outage from IBM Cloud, where the VPN was hosted on IBM Cloud so employees couldn’t log in to fix it, and the email is hosted on IBM Cloud so support teams couldn’t email customers to let them know and even access to their Twitter was behind the non-functional VPN so they couldn’t tweet during the outage either.

kamyarg5y ago

This sounds like a true nightmare.

Do you have a link for more details?

freehunter5y ago

I actually heard the details on a cloud-related podcast, but I found a transcript of the episode here: https://www.lastweekinaws.com/podcast/aws-morning-brief/whit...

The relevant bit:

>[customers] were texting with their account managers, because the account managers had no access to any internal systems. Reportedly, the corporate VPN was not working. My thesis is... everything was single-tracking through a corporate VPN that itself was subject to this disruption... their traditional tweets have been done through an enterprise social media client called Sprinklr

0xmohit5y ago

Almost 9 years have passed by and nothing has changed. The dashboards continue to remain green.

https://news.ycombinator.com/item?id=3707590

1 more reply

camhart5y ago

"This issue has also affected our ability to post updates to the Service Health Dashboard."

Last sentence of the alert at the top of the page.

s_dev5y ago

Always seems to be the case -- this happened before where the status pages updates were stored in ... S3. It goes beyond coincidence when this happens several times in a row.

I think the other explainations sound plausible. There is no technical difficulty here that AWS can't solve -- it's political. Having an outage with a status page makes you liable for your SLAs.

LeoTM5y ago

I'm in the UK and this now may have cascaded onto VISA

https://downdetector.co.uk/status/visa/map/

I am unable to order my Papa Johns pizza

https://imgur.com/u5QSszv

jm547ster5y ago

Topped with?

vishesh925y ago

> "This issue has also affected our ability to post updates to the Service Health Dashboard."

This is why I prefer 3rd party monitoring systems to track health of my internal monitoring systems.

Steve8865y ago

Many applications – Including Anchor, Adobe Spark, Flickr, SiriusXM and Roku reported disruption caused by this outage. https://news.alphastreet.com/huge-aws-outage-affects-a-wide-...

sushikokk5y ago

My iRobot (roomba vacuum robot) app not working for 4 hours...

pluc5y ago

Rule #1 of status pages: never put your status page on the same infrastructure it monitors.

dsagal5y ago

Banner on top of https://status.aws.amazon.com/ just has an update from 8:36AM PST -- just removed -- even thought it's only 7:42AM PST. I guess it's really manual firefighting there.

unilynx5y ago

There's a lot more going on over there...

- 7 cloudfront distributions created today are still in "InProgress", a few already for more than one hour

- The support case I created about it doesn't show up in my support portal. Direct link to it does work though

tootie5y ago

I think the issue is that Kinesis is a single point of failure for a ton of systems. When it goes down, loads of other system's workflows can't operate. AWS is famous for eating their own dog food and someone just poisoned it.

karlkatzke5y ago

Maybe they bought the dogfood from the Amazon Marketplace, but it was counterfeit.

mcphilip5y ago

yeah, I’m seeing event bridge errors and am unable to load cloudwatch log groups. happy short staff day!

swasheck5y ago

Ah yes. It's the annual AWS Thanksgiving Holiday major us-east outage.

gregw25y ago

My guess-- rolling out stuff right before re:Invent each year so it can get announced publicly as "available!".

swasheck5y ago

i always assumed it was a time for major releases but never tied it to re:invent. you may be on to something.

dubcanada5y ago

7:30 AM PST: We are currently blue on Kinesis, Cognito, IoT Core, EventBridge and CloudWatch given an increase in errors for Kinesis in the US-EAST-1 Region. It's not posted on SHD as the issue has impacted our ability to post there. We will update this banner if there continue to be issues with the SHD.

Was posted 8 minutes ago.

mcintyre19945y ago

> 2:43 PM PST Between 5:15 AM and 2:28 PM PST customers experienced increased API failure rates for Cognito User Pools and Identity Pools in the US-EAST-1 Region. This was due to an issue with Kinesis Data Streams. We have implemented a mitigation to this issue. Cognito is now operating normally.

Seems like they fixed Cognito while Kinesis and many other services are still broken - presumably somehow removing the dependency on Kinesis? It’ll be really interesting if their post mortem explains this mitigation.

reese_john5y ago

Kinesis seems to be down to me. Everything is melting, it is like they have Chaos Monkey perpetually on in us-east-1

zedpm5y ago

Every time I check the Personal Health Dashboard, the number of issues increases; it's currently showing 13 open issues for my account. Cloudwatch logs for the last few hours are unavailable; it appears that the log agent is getting errors when it attempts to upload log events. Metrics are spotty or missing.

tysoncadenhead5y ago

Maybe AWS should put their dashboards on GCP

ti_ranger5y ago

> Maybe AWS should put their dashboards on GCP

Then the status page would be almost entirely useless ...

rmujica5y ago

It is now affecting ECS and EKS. Having problems scaling own nodes.

MR4D5y ago

And that confirms it for me: Amazon is officially a Day 2 Company.

Happened faster than I thought, but based on reading the comments about people who work(ed) there, this seems cut and dried to me.

kristianpaul5y ago

CloudWatch is definitely one of those "AWS Primitives" services that side effects others when having problems, something similar happened with DynamoDB some years ago.

dodobirdlord5y ago

In early 2019 CloudWatch had a major outage that was particularly nasty, where instead of just outright failing to report metrics it reported a small percentage of metrics. As a result a lot of autoscaling groups and DynamoDB tables that were theoretically supposed to avoid scaling in during a metric outage still scaled in, because they saw it as a 90+% traffic reduction rather than a metric outage.

xyst5y ago

just looking at this dashboard, I never realized how many services aws has to offer. I’d hate to be the “aws” guy

leothekim5y ago

> This issue has also affected our ability to post updates to the Service Health Dashboard.

This is when you fall back to the Tumblr blog for status updates.

mushufasa5y ago

what a scam. who can hold them accountable for cheating those who paid for uptime guarantees?

I guess the lawyers of those who paid for uptime guarantees...

rexreed5y ago

This is a relevant comment. I agree. Who will hold them accountable for violating uptime guarantees? Nobody? Then what's the point or purpose of an uptime guarantee? Marketing value?

freehunter5y ago

If it’s in a contract, companies can sue. And big enough customers who have lost enough money due to the outage would definitely threaten to sue to recover that money.

kenhwang5y ago

We just ask nicely. Never really had a problem getting a huge % discount on the bill because of an outage. Extra bonus for us since there's no bottom line impact since we can tolerate some downtime (just annoying for engineering).

redisman5y ago

Wouldn't the contract specify the remedy? Or do they actually on paper really promise uptimes that on a single isolated data-center level no one can keep?

outworlder5y ago

> what a scam. who can hold them accountable for cheating those who paid for uptime guarantees?

Never trust that. Deploy in multiple regions (and AZs within those regions) if you really cannot tolerate any downtime.

dbenny5y ago

Apparently they can't update the status page because of the outage. This happened a few years ago with the massive s3 outage.

carusooneliner5y ago

We are seeing an elevated rate of failures on our service, which depends on AWS Cognito. Tweeted an update on it: https://twitter.com/outklip/status/1331705524396625924

btown5y ago

As of 2020-11-25T17:21Z this is also causing a Heroku outage preventing new spin-ups, which presumably uses these APIs to verify instance health. https://status.heroku.com/

doseofreality5y ago

friends don’t let friends use us-east-1.

symlinkk5y ago

Title should be changed, this is a widespread AWS issue, it’s not specific to Cognito.

snvzz5y ago

This is what's called a SNAFU.

LeoTM5y ago

Experiencing 504's from Cognito too, our users can't log in.

"amazon-cognito-identity-js": "^3.2.2" "aws-amplify": "^2.2.2"

simlevesque5y ago

yeah every amplify app should be down/super slow right now.

simlevesque5y ago

I'm getting tired of that bullshit. Just admit it.

that_guy_iain5y ago

AWS' conditions for what healthy is and everyone elses is completely different. Kinda makes me wonder what their internals are like.

SteveNuts5y ago

Probably similarly to the "Downfall" Hitler parody, or the "he's delusional, take him to the infirmary" from Chernobyl, take your pick.

ben5095y ago

I fully admit, with no reservations at all, that you're getting tired of that bullshit.

skavish5y ago

Mediaconvert just stopped processing our queues two hours ago, in all our accounts. Anybody else is having it? It's green on the status board.

unilynx5y ago

completed a job in eu-west-1 10 minutes ago

skavish5y ago

yep, it seems only us-east-1 is affected

bengalister5y ago

Same issue with AWS lambdas I got a: Received malformed response from transform AWS::Serverless-2016-10-31.

It is reported now in their service health dashboard.

agustif5y ago

We've been having lots of issues with Vercel today, since it uses AWS under the hood I'm guessing that's related...

astatine5y ago

The way ahead should be an independent entity who audits systems and has responsibility to certify that the dashboard represents a true and accurate view of the actual status. Like is done so effectively with company financials.

Oh, wait! EY, PWC, and who can forget Arthur Andersen!

But, naturally, technology people can solve this better than anyone else, right?

arusahni5y ago

All my CloudWatch alerts are firing "OK" transitions, and AWS ES isn't displaying any known instances

garymoon5y ago

We experienced 504 errors from Cognito but seems to be that other services are affected as well

ricms935y ago

We have experienced the same on multiple accounts

Nexeo5y ago

Is anyone else getting "Capacity unavailable" when trying to add tasks in Fargate?

sk5t5y ago

EventBridge has been struggling for about the past 14 hours as well, which means Cloudwatch Events is not too happy; and, I have the impression CWE underpins a surprising diversity of other things at AWS.

troelsSteegin5y ago

Would this explain the washingtonpost.com outage? That site has been displaying a "Welcome to OpenResty!" page for the past 20 minutes or so.

EDIT: nevermind, the Post is back, and Kinesis is still erroring.

edoceo5y ago

Also this thread https://news.ycombinator.com/item?id=25209508

TedShiller5y ago

AWS Status website is down for me.

Is there a status website for AWS Status?

revicon5y ago

I’m trying to find a doc on running cognito using multiple zones and I can’t find much. Anyone have a multi-az cognito deployment running right now?

x86_64Ubuntu5y ago

Zones or regions? I don't think multi-az will help you when the whole region kicks the bucket.

tootie5y ago

Any tips on how to collect on SLA credits from this?

runlevel15y ago

The procedure is outlined in each service's SLA -- though I think they're all pretty much the same.

Annoyingly, they expect you to do the leg work to show when the outage happened and supply logs demonstrating that you were impacted.

Might want to do some napkin math first to see if the amount credit is worth your time. The couple times my org considered pursuing it, it just wasn't worth the effort. (Though, personally, I think that speaks to a larger problem with the SLA.)

Credit Request Procedure in Kinesis SLA: https://aws.amazon.com/kinesis/sla/#Credit_Request_and_Payme...

Erlangen5y ago

Is this the reason I have seen connection errors in duolingo,

> upstream connect error or disconnect/reset before headers. reset reason: overflow

dtjones5y ago

We're getting 504 for our well-known jwks file

And request timeouts against cognito-idp.us-east-1.amazonaws.com

And the cognito console won't load

tibbar5y ago

503s from CloudWatch for us.

outworlder5y ago

They should rename that region to us-chaostesting-1 . Problem solved.

totaldude875y ago

i tried reaching out to amazon support, apparently they are also seeing issues internally and there is a high possibility that these two are related..

Their ETA, 2 hours, and then try contacting again!

mikece5y ago

Is this only affecting us-east-1 or other regions as well?

LennyWhiteJr5y ago

Just us-east-1.

unilynx5y ago

But some global services run through us-east-1 - eg Cloudfront is now broken too. So this is also affecting users who don't actually run anything in us-east-1 explicitly (or in the US at all)

mcintyre19945y ago

I'm not seeing any issues here yet with S3 images/website buckets stored in eu-west-1 and served by CloudFront.

You're right that there's definitely some internal coupling though:

> If you want to require HTTPS between viewers and CloudFront, you must change the AWS Region to US East (N. Virginia) in the AWS Certificate Manager console before you request or import a certificate.

From https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...

1 more reply

jadbox5y ago

Having Lambda issues too

ssss115y ago

There is no problem here. jedi mind trick hand wave

shripadk5y ago

Paddle checkout is down as well (connected to this outage): https://twitter.com/PaddleHQ/status/1331659286649466881

j / k navigate · click thread line to collapse

349 comments

PragmaticPulp5y ago

We hired an engineer out of Amazon AWS at a previous company.

FWIW, I have other friends who work on different teams at Amazon who have not had such bad experiences.

uji5y ago

Have worked at AWS before, and I can attest to this. Whenever we had an outage, our director and senior manager would take a call on whether to update the dashboard or not.

Having 'red' dashboard catches lot of eyes, so people responsible for making this decision always look at it from political point of view.

etaioinshrdlu5y ago

alfalfasprout5y ago

2 more replies

dilyevsky5y ago

1 more reply

foota5y ago

I think this is pretty typical, as often outsiders don't have the visibility into the issue to determine whether there's an issue.

ta202007105y ago

The ec2 or s3 dashboards showing red literally requires approval from ajassy himself irrc

The status page is entirely manually updated.

ipsocannibal5y ago

3 more replies

totaldude875y ago

wow, so much to their "leadership principles" , the first one being "customer obsession" and "earning trust", from what I see, this doesn't accomplish either :|

o-__-o5y ago

I’ve got another good FAANG principal joke:

“Don’t be evil”

buys doubleclick

outworlder5y ago

No idea what happens on AWS as I don't work there, but I have another perspective on this.

There are perverse incentives to NOT update your status dashboard. Once I was asked by management to _take our status dashboard down_ . That sounded backwards, so I dug a bit more.

Turns out our competitor was using our status dashboard as ammo against us in their sales pitch. Their claim was that we had too many issues and were unreliable.

We ended up removing the public dashboard and using other mechanisms to notify customers.

necubi5y ago

Yeah, had the same experience at a previous company. It's very frustrating that your transparency gets used against you by unscrupulous competitors.

Razengan5y ago

How is it unscrupulous?

This sort of shit happens all the time at all levels. Companies use each other’s public specs in their competition all the time.

Or capitalizing on features like headphone jacks etc. in their ads before proceeding to remove them from their own products anyway (Samsung and Google) and so on.

4 more replies

social_quotient5y ago

GoGrid used to do this to Rackspace cloud back in the early cloud days. It always left a bad taste for me seeing a social campaign at customers who are currently down.

ipsocannibal5y ago

Imagine your competitors being a couple of smallish companies like Microsoft, Google, and Oracle. Oracle would sacrifice puppies live on YouTube if that would take AWS down a peg.

NicoJuicy5y ago

I'd monitor the competition and use it to your advantage.

Aperocky5y ago

That's opposite of my experience at AWS. It's likely that the culture at AWS has changed over the past few years, it's also likely that there's a difference in culture between teams.

iameli5y ago

chickenpotpie5y ago

800,000 person company has to vary from team to team

thayne5y ago

As a customer, it does seem consistent that the status dashboard doesn't say a service is down until it has been down for quite a while.

PragmaticPulp5y ago

I have no doubt it varies from team to team. Like I said, my other friends at Amazon had more positive experiences.

throwaway20485y ago

One has to be wary of the differences between what is said in places like the employee handbook and espoused as official policy, and what actually ends up happening.

AWS and amazon in general espouse all sorts of values relating to taking responsibility and owning problems.

Whats left unstated is that the management structure hammers you to the wall as soon as they find somebody to blame.

patch_cable5y ago

Echoing what Aperocky said, I worked for Amazon for about 10 years, across a number of different teams. Amazon has its share of problems, but assigning blame for outages was not one of them.

During my 10 years, I had multiple opportunities to break and then fix things. The breaking was always looked at as "these things happen" while the fixing was always commended.

Aperocky5y ago

I'm speaking from experience, not handbooks or policies.

1 more reply

Ensorceled5y ago

"Shooting the messenger" is so common that we, well, have a phrase for it.

geertj5y ago

Ensorceled5y ago

And yet we continue to have “all greens” on the status boards during outages.

JMTQp8lwXL5y ago

sebmellen5y ago

Bitcoin?

throwawayr1235y ago

Forks?

1 more reply

ind3mp0tent5y ago

I always wonder how many more products AWS pushes out the door versus cleaning up and improving what the have already. Cognito itself is such a half-baked mess...

But back to topic, when should we update status pages? On every incident? Or when SLAs are violated?

mathattack5y ago

This sounds like a managerial incentives problem.

If a person or company’s compensation depends on not fessing up to problems, they won’t fess up to them.

jmartens5y ago

one2know5y ago

felixgallo5y ago

stevepotter5y ago

I work at AWS now and can second that. Nobody is happy when things break, but COEs are looked a positively and are circulated constantly to prevent repeats.

1 more reply

patch_cable5y ago

Same. If anything, a well written COE has had positive career impact.

1 more reply

Jabberwock125y ago

hyperdimension5y ago

>a negative career impact

Nothing against you personally of course, but I just have to congratulate whomever it was who came up with this gem of an euphemism. It's definitely going up there next to 'career-limiting move'.

ayberk5y ago

tanilama5y ago

COE won't lead to inevitable LE, it only means your manager wants you to be the scapegoat or you are indeed responsible for it.

Resolving COE can be a positive even if you know how to spin it, at least that was the case when I was there. But not sure whether things had changed

Aperocky5y ago

Having authored COEs before, none of your sentence after 'which means' was true at least in my case, nor are they true for COEs that I know of.

calcifer5y ago

What the hell is a COE? I hate that nobody seems to bother defining their acronyms anymore.

ShroudedNight5y ago

(cause|correction) of error, though 'CoE' is the much better known identifier (like IBM vs International Business Machines).

Edit: This list is incomplete, and the link included in the sibling provides a better, more thorough description.

ttam5y ago

https://wa.aws.amazon.com/wat.concept.coe.en.html

TheHydroImpulse5y ago

If anyone notices a problem you'll likely need to write a COE, there's no way to get around that. Not updating the status page absolutely doesn't get you out of that task.

COE also doesn't lead to negative marks on anyone at AWS that I know of. It's a learning experience to know why it happened and action items so it doesn't happen again.

uji5y ago

This is very true. It doesn't lead to PIP always but whole amazonian culture makes it difficult for the person to stay in team/company.

hypervisorxxx5y ago

Even if you don't know what to say still update saying that so the rest of us can report to our teams and make decisions about our own worklives and personal lives.

piewzko5y ago

Now is probably a good time to plug some of the open source alternatives to vendor locked in identity solutions:

- https://github.com/ory

- https://github.com/dexidp/dex

- https://github.com/authelia/authelia

- https://github.com/keycloak/keycloak

- https://www.gluu.org/

- https://github.com/accounts-js/accounts

grinich5y ago

Shameless plug for WorkOS. (I'm the founder. Hope that's still ok on HN!)

We're like Stripe for SSO/SAML auth. Docs here: https://workos.com/docs

Here's our HN launch: https://news.ycombinator.com/item?id=22607402

kevindong5y ago

I'd expect Amazon to be better able to maintain uptime than a self-hosted option at most (but not all) companies.

rhizome5y ago

Amazon can't diversify their providers, though.

andrewstuart25y ago

They diversify, they just do it at a completely different layer than a cloud consumer.

simlevesque5y ago

- https://www.etebase.com/

lukevp5y ago

Fusionauth is pretty cool. I’ve worked with the team a bit on the .net core support.

mooreds5y ago

This post from one of our customers about moving from Cognito to FusionAuth may be of interest: https://fusionauth.io/blog/2020/11/18/reconinfosec-fusionaut...

Disclosure: I'm an employee of FusionAuth, and while there is a forever free community edition, it is free as in beer, not as in speech.

xyst5y ago

im surprised companies still want to build their own identity system or pay companies (ping, auth0) to host it for them

ory looks like a really good project

technics2565y ago

Anyone have thoughts on their experience with keycloak?

mooreds5y ago

I haven't used it, but heard it's ... complex to get set up and run. (Again, I work for a competitor.)

Here's a reddit with a bunch of posts you could sift through: https://www.reddit.com/r/KeyCloak/

agustif5y ago

add AccountsJS, a small nice modular typescript/js lib for building account systems easily

rcardo11OP5y ago

Well, this is a major outgage

Schweigi5y ago

Indeed, we had the first AWS Kinesis issues already at 13:50 (UTC). Now it's still ongoing after two hours. The status page didn't even update in the first 45 min or so...

jjoonathan5y ago

simlevesque5y ago

Isn't that fraud ?

edit: not sure why my question deserved a downvote...

1 more reply

chizhik-pyzhik5y ago

Updating the status dashboard is pretty low priority for operators trying to resolve this issue. It requires escalation up the management chain and careful wording.

3 more replies

tootie5y ago

As of this moment, there are more non-green services than I've ever seen. And it's steadily getting worse.

EDIT: 15 minutes later and the board is looking worse again.

manishsharan5y ago

booleanbetrayal5y ago

This is also affecting Fargate (at least EKS) in that its scheduling system is broken. No way to get new pods.

zxcvbn40385y ago

sethhochberg5y ago

Same story in ECS. Seems like virtually anything Fargate can't spawn new instances.

jpp5y ago

We're also seeing issues with FarGate ECS -- the task we had with auto-scaling scaled down to 0. The one we had with a fixed number of workers is fine.

mcintyre19945y ago

Thanks for mentioning this - that's a nasty failure mode.

odiroot5y ago

It's always a DNS issue.

mdaniel5y ago

That's a tough one -- I'm usually with you that it's always either DNS or cert expiry, but my go-to "it's always ..." when discussing AWS is: it's always security groups

Heh, maybe they accidentally locked themselves out of IAM, since those are great fun to troubleshoot, also

bart_spoon5y ago

I'm also seeing weirdness with Batch. Its working, but the dashboards aren't showing job statuses accurately and jobs aren't always terminating.

cpufry5y ago

and this is whats disclosed to the public

durkie5y ago

seeing issues with scaling up/down in elastic beanstalk too

holler5y ago

yep, iot in us-east-1 not working for me

driverdan5y ago

Five hours later and nothing has changed. For a company like Amazon this should be unacceptable.

ttam5y ago

All on the eve of thanksgiving.

Bombthecat5y ago

I think the deeper problem is the interconnectivity between services and their apis. It's too complicated to maintain...

one2know5y ago

dodobirdlord5y ago

ultimoo5y ago

2 more replies

ben5095y ago

Yeah, the cascading failure of all the other services is a deep architectural issue.

But it does mean that, fundamentally, service X is available when and only when (WAOW?) services A, B, C, etc. are all available. Its uptime is no greater than min(uptime(A), uptime(B), etc)

adewinter5y ago

Interested to know what the alternative might be and why it would mean better uptime

1 more reply

donor205y ago

Seriously, I get that something falls over. But to have it be 5 hours to recover, for a service this critical is nuts

hallway_monitor5y ago

jlmorton5y ago

More like ten hours at this point for Kinesis

that_guy_iain5y ago

This is what SLAs are for. Especially if you're using a 3rd party service.

avh025y ago

that_guy_iain5y ago

They don't help your customers, but they help you recover the lost revenue. The penalties we have in our SLA result in our company doing things they really don't want to pay.

1 more reply

bithavoc5y ago

"I want to have an AWS region where everything breaks with high frequency..."[0] discussed here [1]

[0] https://twitter.com/apgwoz/status/1292519906433306625?s=20

[1] https://news.ycombinator.com/item?id=24103746

maletor5y ago

Isn't that just called us-east-1?

riyadparvez5y ago

I've read this multiple times that AWS us-east-1 region is the one that has the highest number of outages. I am eager to hear others' experiences here.

glenngillen5y ago

People are just projecting their own cognitive biases.

WrtCdEvrydy5y ago

us-east-1 is the zone with highest load and most new services are tested there first.

rumor has it, some of the older hardware is moved there and that's why prices are a little cheaper but I have not been able to confirm that.

2 more replies

jschulenklopper5y ago

`us-wtf-1`

file5y ago

ive... never loved a region before

turdnagel5y ago

Isn't it common practice to host your status board on someone else's infrastructure?

tuwtuwtuwtuw5y ago

Frost1x5y ago

I find it surprising how many people forget how much underlying business motives drive pretty much every action they make and how this is quickly forgotten by many.

No matter how much you value science and engineering, it ultimately doesn't matter to the business unless that aligns directly with their revenue stream. Sometimes it does, sometimes it doesn't.

tuwtuwtuwtuw5y ago

2 more replies

srveale5y ago

Reminds me of: "When a measure becomes a target, it ceases to be a good measure"

brown9-25y ago

The marketing aspect of what? No one is choosing a vendor based on where they store their status page

colinbartlett5y ago

I operate StatusGator, which is a service that aggregates status pages so I'm ALL TOO familiar with the AWS status page.

WrtCdEvrydy5y ago

S3 East didn't affect the ability but they couldn't swap out the green checkmark for the red checkmark... which is just hilarious.

zucked5y ago

That day was a nightmare for a lot of people - it wasn't just S3 that went down, it was like all of US-EAST.

Luckily my company decided against multi-az for the cost savings so I spent all day firefighting.

actionowl5y ago

Multi-AZ doesn't help when a whole region is down, unless you're referring to multi-region AZs (e.g us-east-1a and us-west-1a)

rhizome5y ago

I have to think they're talking about the latter.

eternalban5y ago

So what’s the cost breakdown? Did they make the right decision?

dexterdog5y ago

For one day of his time and probably a small part of a day of diminished service, most likely.

jmartens5y ago

In a world where we can do virtually anything we want with technology, why do we rely on vendors updating their own individual status pages?

throwaway3434325y ago

Large-scale events (LSEs) are becoming more and more common. It'll keep getting worse.

AWS will be in deep trouble when/if GCE fixes their customer support.

ipsocannibal5y ago

"Large-scale events (LSEs) are becoming more and more common." Stats on this?

You seem to have insight on AWS's engineering practices. From your point of view what should be changed?

s_dev5y ago

Can anyone explain why status pages are so difficult. Theres even statups like status.io dedicated to this one thing.

It really does seem that anytime there is an outage more often than not the status page is showing all green traffic lights. Making it redundant as a tool to corroborate whats happening.

How did AWS status page compare with status.io/aws?

xyzzy1235y ago

When your company gets sufficiently large, outages become political.

Failure happens at the speed of computing but agreeing that something is failing in a way that customers need to be told about is a slower process.

Even when status pages are fully automatic (rather than manually updated), there will tend to be gaming of the metrics that constitute that.

Ideally you would just be monitoring your SLOs and publishing that to customers... that doesn't seem to be how it works, anywhere.

freehunter5y ago

drchopchop5y ago

dexterdog5y ago

All of those are reasons that the determination of status should be totally independent of the company technically and legally.

PragmaticPulp5y ago

Many companies tie uptime and outages to performance reviews, either directly or indirectly.

Twirrim5y ago

> Can anyone explain why status pages are so difficult.

What is an outage? When does an outage reach sufficient scale that updating the status page is the right thing to do?

I used to work for AWS, and now work for another cloud provider.

One thing that's hard to communicate is the sheer scale that these services operate at, what that means architecturally, and how they tend to break.

How do you communicate that effectively to customers, in such a way that doesn't suggest your service is unreliable when it really isn't.

beoberha5y ago

Kinrany5y ago

Posting percentages instead of green/red would fix all of these, no?

Twirrim5y ago

Not really. People will automatically assume they were in that impacted percentage and that what was happening with their stuff was entirely AWS's fault.

opmac5y ago

You'd think they would have learned from that.

Twirrim5y ago

They did. It came up in the post incident report, and senior leadership kicked off work to have it run on its own distinct infrastructure so that this wouldn't happen again.

If you look at the source code for the site, you'll again see that everything is hosted from the same domain.

One of their main goals was to ensure that it could never go wrong that way again.

sleepybrett5y ago

(SHD being the Service Health Dashboard)

opmac5y ago

And the fact remains that currently an outage of AWS's own infrastructure is impacting AWS's ability to status updates on its own status dashboard. It's just seems so... amateurish.

1 more reply

ti_ranger5y ago

> It is kind of perplexing that AWS dogfoods its own status page.

> You'd think they would have learned from that.

They did.

The page has been updated numerous times since the start of this incident.

opmac5y ago

From the status page:

> This issue has also affected our ability to post updates to the Service Health Dashboard.

Just seems so ridiculous that they have trouble reporting the impaired status of their system due to... the impaired status of that same system.

aledalgrande5y ago

it was 1.5 hours before the first service was put on yellow

latch5y ago

Status pages, like SLAs, are sales tool - not engineering tools. At best, they are there to help decision makers go through their checklist. At worse, they exist to deceive.

jmartens5y ago

1 million percent!

Which makes me wonder, why do we all rely on status pages rather than solve the problem ourselves in ways that don't require us to rely on the vendor?

Griffinsauce5y ago

jessaustin5y ago

One never thought we'd see the day... Twitter, that storied home of the whales of fail, is the reliable service.

darkcha0s5y ago

I completely agree, but can we talk for a second how absurd it is charging 90$ for essentially a service that just pings your infrastructure?

acdha5y ago

Try undercutting it. At some point you’ll learn that the problem isn’t that simple, operations is a key part of the product and isn’t free, and people expect support for important services.

jmartens5y ago

bak3y5y ago

Any time companies have SLA's where money is on the line if they admit they're having an outage, they're going to be delayed on updating a status page.

sailfast5y ago

It says on this outage page (as of 11:11 ET) that the problem with Kinesis is also causing problems updating this outage dashboard which may explain the delay?

swyx5y ago

there's also https://stop.lying.cloud/

cheeze5y ago

Ironically I can't even load that page

Thaxll5y ago

It's not that easy to quantify how down a service is at the scale of AWS, for example Cognito has issues, does it means every services that rely on that have issues, what is the impact etc...

bird_monster5y ago

They aren't difficult. Amazon has no interest in having a working status page. Amazon would prefer the appearance of always green checkmarks over actually having a status page.

zxcvbn40385y ago

I think we are learning everything that uses AWS Kinesis internally which is cool. It’s always fascinating to learn how AWS works on the backend.

salil9995y ago

I work at AWS. I can tell you surely enough it's not pretty or easy to work with. Design and architecture are great here but implementation of that is pretty crap...

sheeshkebab5y ago

Why use it then? (api is crap, uptime is crap, limits are crap... politics?)

that_guy_iain5y ago

cheeze5y ago

Money. Lack of alternatives.

Cheaper than GCP. Still less crappy than Azure.

zxcvbn40385y ago

throwaway3434325y ago

The only thing magical about AWS' backend is how much manpower they can throw at things.

2 more replies

orf5y ago

I would be beyond fascinated at how IAM works under the hood.

drfritznunkie5y ago

Cognito is one of the most frustrating AWS services I have to work with, it is almost, but not quite, entirely unlike an SP.

We're using it to federate customer IDPs through user pools, but this ends up with customer configs being region specific.

sk5t5y ago

Eh? Brokering amongst multiple trusts (and managing protocol transition) is almost the raison d'etre for lifting token issuance out of your app and into ADFS, Okta, Auth0, etc.

Of course you'll have to deal with home realm discovery--really need to go in with open eyes on that one.

drfritznunkie5y ago

myleshenderson5y ago

This was shared with me today: https://medium.com/@nealrp/aws-cross-region-cognito-replicat...

2 more replies

sk5t5y ago

_0o6v5y ago

> It's not posted on SHD as the issue has impacted our ability to post there.

Is that not a massive catch-22 for a service dashboard?

snazz5y ago

This has happened a few times before, actually. Dogfooding is good, but not for status pages!

judge20205y ago

They do use their own registrar though:

  $ whois cloudflarestatus.com
  Registrar: Cloudflare, Inc.

toong5y ago

Have you checked https://www.githubstatus.com/ ? ;-)

lights01235y ago

GitHub doesn't own their own datacenters.

freehunter5y ago

kamyarg5y ago

This sounds like a true nightmare.

Do you have a link for more details?

freehunter5y ago

I actually heard the details on a cloud-related podcast, but I found a transcript of the episode here: https://www.lastweekinaws.com/podcast/aws-morning-brief/whit...

The relevant bit:

0xmohit5y ago

Almost 9 years have passed by and nothing has changed. The dashboards continue to remain green.

https://news.ycombinator.com/item?id=3707590

1 more reply

camhart5y ago

"This issue has also affected our ability to post updates to the Service Health Dashboard."

Last sentence of the alert at the top of the page.

s_dev5y ago

Always seems to be the case -- this happened before where the status pages updates were stored in ... S3. It goes beyond coincidence when this happens several times in a row.

I think the other explainations sound plausible. There is no technical difficulty here that AWS can't solve -- it's political. Having an outage with a status page makes you liable for your SLAs.

LeoTM5y ago

I'm in the UK and this now may have cascaded onto VISA

https://downdetector.co.uk/status/visa/map/

I am unable to order my Papa Johns pizza

https://imgur.com/u5QSszv

jm547ster5y ago

Topped with?

vishesh925y ago

> "This issue has also affected our ability to post updates to the Service Health Dashboard."

This is why I prefer 3rd party monitoring systems to track health of my internal monitoring systems.

Steve8865y ago

Many applications – Including Anchor, Adobe Spark, Flickr, SiriusXM and Roku reported disruption caused by this outage. https://news.alphastreet.com/huge-aws-outage-affects-a-wide-...

sushikokk5y ago

My iRobot (roomba vacuum robot) app not working for 4 hours...

pluc5y ago

Rule #1 of status pages: never put your status page on the same infrastructure it monitors.

dsagal5y ago

Banner on top of https://status.aws.amazon.com/ just has an update from 8:36AM PST -- just removed -- even thought it's only 7:42AM PST. I guess it's really manual firefighting there.

unilynx5y ago

There's a lot more going on over there...

- 7 cloudfront distributions created today are still in "InProgress", a few already for more than one hour

- The support case I created about it doesn't show up in my support portal. Direct link to it does work though

tootie5y ago

karlkatzke5y ago

Maybe they bought the dogfood from the Amazon Marketplace, but it was counterfeit.

mcphilip5y ago

yeah, I’m seeing event bridge errors and am unable to load cloudwatch log groups. happy short staff day!

swasheck5y ago

Ah yes. It's the annual AWS Thanksgiving Holiday major us-east outage.

gregw25y ago

My guess-- rolling out stuff right before re:Invent each year so it can get announced publicly as "available!".

swasheck5y ago

i always assumed it was a time for major releases but never tied it to re:invent. you may be on to something.

dubcanada5y ago

Was posted 8 minutes ago.

mcintyre19945y ago

reese_john5y ago

Kinesis seems to be down to me. Everything is melting, it is like they have Chaos Monkey perpetually on in us-east-1

zedpm5y ago

tysoncadenhead5y ago

Maybe AWS should put their dashboards on GCP

ti_ranger5y ago

> Maybe AWS should put their dashboards on GCP

Then the status page would be almost entirely useless ...

rmujica5y ago

It is now affecting ECS and EKS. Having problems scaling own nodes.

MR4D5y ago

And that confirms it for me: Amazon is officially a Day 2 Company.

Happened faster than I thought, but based on reading the comments about people who work(ed) there, this seems cut and dried to me.

kristianpaul5y ago

CloudWatch is definitely one of those "AWS Primitives" services that side effects others when having problems, something similar happened with DynamoDB some years ago.

dodobirdlord5y ago

xyst5y ago

just looking at this dashboard, I never realized how many services aws has to offer. I’d hate to be the “aws” guy

leothekim5y ago

> This issue has also affected our ability to post updates to the Service Health Dashboard.

This is when you fall back to the Tumblr blog for status updates.

mushufasa5y ago

what a scam. who can hold them accountable for cheating those who paid for uptime guarantees?

I guess the lawyers of those who paid for uptime guarantees...

rexreed5y ago

This is a relevant comment. I agree. Who will hold them accountable for violating uptime guarantees? Nobody? Then what's the point or purpose of an uptime guarantee? Marketing value?

freehunter5y ago

If it’s in a contract, companies can sue. And big enough customers who have lost enough money due to the outage would definitely threaten to sue to recover that money.

kenhwang5y ago

redisman5y ago

Wouldn't the contract specify the remedy? Or do they actually on paper really promise uptimes that on a single isolated data-center level no one can keep?

outworlder5y ago

> what a scam. who can hold them accountable for cheating those who paid for uptime guarantees?

Never trust that. Deploy in multiple regions (and AZs within those regions) if you really cannot tolerate any downtime.

dbenny5y ago

Apparently they can't update the status page because of the outage. This happened a few years ago with the massive s3 outage.

carusooneliner5y ago

We are seeing an elevated rate of failures on our service, which depends on AWS Cognito. Tweeted an update on it: https://twitter.com/outklip/status/1331705524396625924

btown5y ago

As of 2020-11-25T17:21Z this is also causing a Heroku outage preventing new spin-ups, which presumably uses these APIs to verify instance health. https://status.heroku.com/

doseofreality5y ago

friends don’t let friends use us-east-1.

symlinkk5y ago

Title should be changed, this is a widespread AWS issue, it’s not specific to Cognito.

snvzz5y ago

This is what's called a SNAFU.

LeoTM5y ago

Experiencing 504's from Cognito too, our users can't log in.

"amazon-cognito-identity-js": "^3.2.2" "aws-amplify": "^2.2.2"

simlevesque5y ago

yeah every amplify app should be down/super slow right now.

simlevesque5y ago

I'm getting tired of that bullshit. Just admit it.

that_guy_iain5y ago

AWS' conditions for what healthy is and everyone elses is completely different. Kinda makes me wonder what their internals are like.

SteveNuts5y ago

Probably similarly to the "Downfall" Hitler parody, or the "he's delusional, take him to the infirmary" from Chernobyl, take your pick.

ben5095y ago

I fully admit, with no reservations at all, that you're getting tired of that bullshit.

skavish5y ago

Mediaconvert just stopped processing our queues two hours ago, in all our accounts. Anybody else is having it? It's green on the status board.

unilynx5y ago

completed a job in eu-west-1 10 minutes ago

skavish5y ago

yep, it seems only us-east-1 is affected

bengalister5y ago

Same issue with AWS lambdas I got a: Received malformed response from transform AWS::Serverless-2016-10-31.

It is reported now in their service health dashboard.

agustif5y ago

We've been having lots of issues with Vercel today, since it uses AWS under the hood I'm guessing that's related...

astatine5y ago

Oh, wait! EY, PWC, and who can forget Arthur Andersen!

But, naturally, technology people can solve this better than anyone else, right?

arusahni5y ago

All my CloudWatch alerts are firing "OK" transitions, and AWS ES isn't displaying any known instances

garymoon5y ago

We experienced 504 errors from Cognito but seems to be that other services are affected as well

ricms935y ago

We have experienced the same on multiple accounts

Nexeo5y ago

Is anyone else getting "Capacity unavailable" when trying to add tasks in Fargate?

sk5t5y ago

troelsSteegin5y ago

Would this explain the washingtonpost.com outage? That site has been displaying a "Welcome to OpenResty!" page for the past 20 minutes or so.

EDIT: nevermind, the Post is back, and Kinesis is still erroring.

edoceo5y ago

Also this thread https://news.ycombinator.com/item?id=25209508

TedShiller5y ago

AWS Status website is down for me.

Is there a status website for AWS Status?

revicon5y ago

I’m trying to find a doc on running cognito using multiple zones and I can’t find much. Anyone have a multi-az cognito deployment running right now?

x86_64Ubuntu5y ago

Zones or regions? I don't think multi-az will help you when the whole region kicks the bucket.

tootie5y ago

Any tips on how to collect on SLA credits from this?

runlevel15y ago

The procedure is outlined in each service's SLA -- though I think they're all pretty much the same.

Annoyingly, they expect you to do the leg work to show when the outage happened and supply logs demonstrating that you were impacted.

Credit Request Procedure in Kinesis SLA: https://aws.amazon.com/kinesis/sla/#Credit_Request_and_Payme...

Erlangen5y ago

Is this the reason I have seen connection errors in duolingo,

> upstream connect error or disconnect/reset before headers. reset reason: overflow

dtjones5y ago

We're getting 504 for our well-known jwks file

And request timeouts against cognito-idp.us-east-1.amazonaws.com

And the cognito console won't load

tibbar5y ago

503s from CloudWatch for us.

outworlder5y ago

They should rename that region to us-chaostesting-1 . Problem solved.

totaldude875y ago

i tried reaching out to amazon support, apparently they are also seeing issues internally and there is a high possibility that these two are related..

Their ETA, 2 hours, and then try contacting again!

mikece5y ago

Is this only affecting us-east-1 or other regions as well?

LennyWhiteJr5y ago

Just us-east-1.

unilynx5y ago

But some global services run through us-east-1 - eg Cloudfront is now broken too. So this is also affecting users who don't actually run anything in us-east-1 explicitly (or in the US at all)

mcintyre19945y ago

I'm not seeing any issues here yet with S3 images/website buckets stored in eu-west-1 and served by CloudFront.

You're right that there's definitely some internal coupling though:

From https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...

1 more reply

jadbox5y ago

Having Lambda issues too

ssss115y ago

There is no problem here. jedi mind trick hand wave

shripadk5y ago

Paddle checkout is down as well (connected to this outage): https://twitter.com/PaddleHQ/status/1331659286649466881

j / k navigate · click thread line to collapse