undefined | Better HN

0 pointsTwirrim5y ago0 comments

Congratulations, you're already complicating the status page.

The status page shouldn't be figuring out what the status of any service is. It's impossible to do without a lot of contextual information about a service and understanding how to evaluate service impact, something that is continually in flux.

It just needs to be a page that is updated manually. AWS has a 24x7 incident management team that could / should do it.

0 comments

Kinrany5y ago

Updated manually by whom?

I'm afraid you're shifting the complexity to a manual process.

I agree that it doesn't have to, and perhaps should not, be fully automated. But automating some parts will help not waste time on last minute arguments.

TwirrimOP5y ago

> Updated manually by whom? > I'm afraid you're shifting the complexity to a manual process.

You're right, that's 100% what I'm doing. Why? Because it shouldn't be that complicated to update an overall health status page during an outage event, and it shouldn't take other tools and services within AWS to do it.

A common pattern in cloud providers (including AWS) is that services have some kind of tiering, whereby you can't pick up a dependency on any service on a lower tier than yourselves. Tier 2 services can't rely on Tier 3 services, etc. Services like, say, IAM, would be right at the very top. It can't rely on EBS, ELB etc. Everything has to be created in-service, because everything ultimately has to rely on authentication working.

If they're going to keep an overall status page going, it needs to be seen as a top tier service, just like identity is. That's where they were headed towards when I left AWS about 5 1/2 years ago. It had been spurred by a previous major incident couldn't be reflected in the status dashboard because of a failure in a dependency.

> I agree that it doesn't have to, and perhaps should not, be fully automated. But automating some parts will help not waste time on last minute arguments.

I go in to a bit more detail in another comment within this discussion, but a status page does not even close to accurately capture the ways that cloud environments fail, which are very, very rarely affecting more than a small percentage of customers, and even then often in some very specific way under specific circumstances. That's why AWS built the personalised status page service. They want to ensure that customers have an accurate way of telling what is going on with services they're consuming, rather than the confusing situation of checking an overall status site that doesn't really reflect their experience and never could.

Situations like today's where it at least (from the outside) seemed like Kinesis was completely down, would be a good example of something that should be reflect in the main overall status page.

The status page should be manual, and should be something the incident management team can do (and have political ability to force it to happen, rather than being subject to the whims of service directors)

j / k navigate · click thread line to collapse