Whenever one of our cloud services went down, he would go to great lengths to not update our status dashboard. When we finally forced him to update the status page, he would only change it to yellow and write vague updates about how service might be degraded for some customers. He flat out refused to ever admit that the cloud services were down.
After some digging, he told us that admitting your services were down was considered a death sentence for your job at his previous team at Amazon. He was so scarred from the experience that he refused to ever take responsibility for outages. Ultimately, we had to put someone else in charge of updating the status page because he just couldn't be trusted.
FWIW, I have other friends who work on different teams at Amazon who have not had such bad experiences.
Having 'red' dashboard catches lot of eyes, so people responsible for making this decision always look at it from political point of view.
As a dev oncall, we used to get 20 sev2s per day (an oncall ticket which needs to be handled within 15 mins) so most of the time things are broken, its just that its not visible to external customers through dashboard.
The status page is entirely manually updated.
There are perverse incentives to NOT update your status dashboard. Once I was asked by management to _take our status dashboard down_ . That sounded backwards, so I dug a bit more.
Turns out our competitor was using our status dashboard as ammo against us in their sales pitch. Their claim was that we had too many issues and were unreliable.
That was ironic, because they didn't even have a status dashboard to begin with. Also, an outage on their system was much more catastrophic than an outage on our system. Ours was, for the most part, a control plane. If it went down, customers would lose management abilities for as long as the outage persisted. An outage at our competitor, meanwhile, would bring customer systems down.
We ended up removing the public dashboard and using other mechanisms to notify customers.
I assume there's some selection bias going on whenever we're able to hire people out of FAANG companies. We compensated similarly, but in theory had a lower promotion ceiling simply because we weren't FAANG. I assume he wanted out of Amazon because he wasn't on a great team there.
AWS and amazon in general espouse all sorts of values relating to taking responsibility and owning problems.
Whats left unstated is that the management structure hammers you to the wall as soon as they find somebody to blame.
I always wonder how many more products AWS pushes out the door versus cleaning up and improving what the have already. Cognito itself is such a half-baked mess...
But back to topic, when should we update status pages? On every incident? Or when SLAs are violated?
If a person or company’s compensation depends on not fessing up to problems, they won’t fess up to them.
Resolving COE can be a positive even if you know how to spin it, at least that was the case when I was there. But not sure whether things had changed
COE also doesn't lead to negative marks on anyone at AWS that I know of. It's a learning experience to know why it happened and action items so it doesn't happen again.
Writing COE is kind of admission of guilt and I have definitely seen promotions getting delayed. During perf-review, lot of times managers of other teams raise COE has a point against the person going for promotion.
Even if you don't know what to say still update saying that so the rest of us can report to our teams and make decisions about our own worklives and personal lives.
- https://github.com/dexidp/dex
- https://github.com/authelia/authelia
We're like Stripe for SSO/SAML auth. Docs here: https://workos.com/docs
Here's our HN launch: https://news.ycombinator.com/item?id=22607402
Regular Joes like us can use AWS, GCE, on premises, some non-reseller colocation provider, etc., and create failover duplicates, alternative deploy targets, or simply not ever have a complete outage due to the unlikelihood of all of these things failing at once.
Disclosure: I'm an employee of FusionAuth, and while there is a forever free community edition, it is free as in beer, not as in speech.
ory looks like a really good project
Here's a reddit with a bunch of posts you could sift through: https://www.reddit.com/r/KeyCloak/
Well, this is a major outgage
The Lambda function associated with the CloudFront distribution is invalid or doesn't have the required permissions. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner. If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
Heh, maybe they accidentally locked themselves out of IAM, since those are great fun to troubleshoot, also
Before someone replies and says use a different AZ, that's not possible for everyone. If you use a 3rd party service that is hosted on us-east-1 you can't do anything about it. For example, many Heroku services are broken because of this.
All on the eve of thanksgiving.
Having lots of services that do one thing and one thing well makes a lot of sense. Breaking them out into separate components brings a level of visibility into the system. And it's AWS's whole business model.
But it does mean that, fundamentally, service X is available when and only when (WAOW?) services A, B, C, etc. are all available. Its uptime is no greater than min(uptime(A), uptime(B), etc)
I'm trying to rework the authentication for our application and integrate it with our parent company's systems. As we talk to other teams, I see all these architecture diagrams where the solution to every problem is Yet Another Service, to where you're running a real rube goldberg machine.
[0] https://twitter.com/apgwoz/status/1292519906433306625?s=20
In 2017 there was an S3 issue that supposedly affected their ability to post. I believe they said that they were updating how they posted to the status board so that there would no longer be a dependency on S3. Well, I guess whatever they're dependent on now broke.
No matter how much you value science and engineering, it ultimately doesn't matter to the business unless that aligns directly with their revenue stream. Sometimes it does, sometimes it doesn't.
When you're advertising uptime/availability, you're motivated not to report downtime/unavailability. Then the value of such reports is lost; developers start banging their heads trying to figure out if it's a service outage or a bug in their software (yes, informed by personal experience).
The main change they made in 2017 was the ability to post a message at the top of the page that is independent of the status of the individual items below. IIRC, it was the items they couldn't update. So that is kind of a hack, but it works.
It would be ideal if it was host entirely on completely separate infrastructure, and even a separate domain, but I won't hold my breath. Theirs is still more reliable than, for example, the IBM Cloud status page which was hard down during their epic outage back in June.
Luckily my company decided against multi-az for the cost savings so I spent all day firefighting.
AWS has to take a hard look at how they build their software. Their bad engineering practices will eventually catch up to them. You can't treat AWS the same as Alexa. Sometimes it's smarter to take your time to ship stuff instead of putting it out there. Burning out your oncall engineers is not a feasible long-term plan.
AWS will be in deep trouble when/if GCE fixes their customer support.
You seem to have insight on AWS's engineering practices. From your point of view what should be changed?
It really does seem that anytime there is an outage more often than not the status page is showing all green traffic lights. Making it redundant as a tool to corroborate whats happening.
How did AWS status page compare with status.io/aws?
Failure happens at the speed of computing but agreeing that something is failing in a way that customers need to be told about is a slower process.
Even when status pages are fully automatic (rather than manually updated), there will tend to be gaming of the metrics that constitute that.
Ideally you would just be monitoring your SLOs and publishing that to customers... that doesn't seem to be how it works, anywhere.
Publicly disclosing an incident to a customer is embarrassing and potentially damaging but almost equally as damaging is telling other teams you had an incident. Now anything that goes wrong is your fault by default because “it’s probably related to that incident” and any new security policies are blamed on the other team: “we wouldn’t have to do that if Ops didn’t mess up last month”.
The answer to “is this service suffering an outage” is seriously complex and hard to determine. The answer to “is this a security incident” is 10x harder and 100x more political because the industry is still just so wildly immature.
Admitting that your services are down could be costly to your career progression and bonus. When people know this, they go to great lengths to avoid admitting fault. Updating the status page is the first admission of fault. The longer the status page shows an outage, the worse it gets.
I worked with an ex-Amazon engineer at a previous company. After each outage, he would spend days or weeks writing long reports explaining how the outage was not his fault. He didn't care so much about downtime so much as not getting blamed for outages. Predictably, this was terrible for team morale and most of his team members ended up quitting.
If anyone else finds themselves in this position, the solution is have another team responsible for monitoring uptime, and to rate teams on how quickly they acknowledge outages. Once the response time and accuracy of your status page becomes a performance metric, people are less likely to play games with it.
What is an outage? When does an outage reach sufficient scale that updating the status page is the right thing to do?
I used to work for AWS, and now work for another cloud provider.
One thing that's hard to communicate is the sheer scale that these services operate at, what that means architecturally, and how they tend to break.
Outages, even just slight degradation, occurring on a whole service scale are very rare. I would argue from my experiences there that most incidents affect less than 10% of any given service's customers. Whether it gets noticed in part depends on who is encompassed by that percentage.
What is very often the case is that a subset of customers get impacted to some degree during any given incident. That can be even things like single percentage of customers or less, but be an incident that has all hands to deck and the entire management chain of the service aware and involved in.
At what percentage do you draw the line and say "Yes we need this many percentage of our customers to be affected before we post a green-i" (AWS terminology for the first stage of failure notification).
How do you communicate that effectively to customers, in such a way that doesn't suggest your service is unreliable when it really isn't.
The moment you post a green-i or above, customers start blaming you and your service for problems with their infrastructure that are not caused by it. If you're looking to use a service and go look at the status history and see it filled with green-i or similar, are you likely to trust it? No. Even if those green-i's were for impacts on a limited subset of customers.
AWS wrestled with this a bunch about 5-6 years ago. There were no end of discussions during the weekly ops meetings with senior leadership, directors and engineers across the company. Everyone wants to do the right thing and make sure customers get an accurate picture about the health of the service, without giving the wrong impression.
In the end they opted to move towards having personal notifications for outages, and build tooling to help services quickly identify which customers are being affected by any particular incident and provide personalised status pages for them that can be way more accurate than any generalised status page.
You'd think they would have learned from that.
If you look at where the content on https://status.aws.amazon.com/ is actually hosted from you'll see things like the status icons are all hosted under the same domain, e.g. https://status.aws.amazon.com/images/status1.gif https://status.aws.amazon.com/images/status0.gif etc.
If you look at the source code for the site, you'll again see that everything is hosted from the same domain.
One of their main goals was to ensure that it could never go wrong that way again.
> You'd think they would have learned from that.
They did.
The page has been updated numerous times since the start of this incident.
Which makes me wonder, why do we all rely on status pages rather than solve the problem ourselves in ways that don't require us to rely on the vendor?
We're using it to federate customer IDPs through user pools, but this ends up with customer configs being region specific.
Has anyone figured out how to set up Cognito in multiple regions without the hijinx of having the customer setup trusts for each region? Not to mention, while multiple trusts are I think possible with ADFS (not that I've tested it), I'm pretty sure that Okta doesn't support multiple trusts, so regardless of how many regions, we'd still be SOL there...
Of course you'll have to deal with home realm discovery--really need to go in with open eyes on that one.
Is that not a massive catch-22 for a service dashboard?
Cloudflare does it right for their status page (https://www.cloudflarestatus.com). They don't use Cloudflare itself for it (you can tell because /cdn-cgi/trace returns nothing), the actual backend is Atlassian Statuspage, their TLS certificate is issued by Let's Encrypt instead of Cloudflare itself, and it's on a completely separate domain for DNS purposes.
$ whois cloudflarestatus.com
Registrar: Cloudflare, Inc.Last sentence of the alert at the top of the page.
I think the other explainations sound plausible. There is no technical difficulty here that AWS can't solve -- it's political. Having an outage with a status page makes you liable for your SLAs.
https://downdetector.co.uk/status/visa/map/
I am unable to order my Papa Johns pizza
This is why I prefer 3rd party monitoring systems to track health of my internal monitoring systems.
- 7 cloudfront distributions created today are still in "InProgress", a few already for more than one hour
- The support case I created about it doesn't show up in my support portal. Direct link to it does work though
Was posted 8 minutes ago.
Seems like they fixed Cognito while Kinesis and many other services are still broken - presumably somehow removing the dependency on Kinesis? It’ll be really interesting if their post mortem explains this mitigation.
Then the status page would be almost entirely useless ...
Happened faster than I thought, but based on reading the comments about people who work(ed) there, this seems cut and dried to me.
This is when you fall back to the Tumblr blog for status updates.
<rimshot>
I guess the lawyers of those who paid for uptime guarantees...
Never trust that. Deploy in multiple regions (and AZs within those regions) if you really cannot tolerate any downtime.
"amazon-cognito-identity-js": "^3.2.2" "aws-amplify": "^2.2.2"
It is reported now in their service health dashboard.
Oh, wait! EY, PWC, and who can forget Arthur Andersen!
But, naturally, technology people can solve this better than anyone else, right?
EDIT: nevermind, the Post is back, and Kinesis is still erroring.
Is there a status website for AWS Status?
Annoyingly, they expect you to do the leg work to show when the outage happened and supply logs demonstrating that you were impacted.
Might want to do some napkin math first to see if the amount credit is worth your time. The couple times my org considered pursuing it, it just wasn't worth the effort. (Though, personally, I think that speaks to a larger problem with the SLA.)
Credit Request Procedure in Kinesis SLA: https://aws.amazon.com/kinesis/sla/#Credit_Request_and_Payme...
> upstream connect error or disconnect/reset before headers. reset reason: overflow
And request timeouts against cognito-idp.us-east-1.amazonaws.com
And the cognito console won't load
Their ETA, 2 hours, and then try contacting again!