undefined | Better HN

0 pointsoooyay2y ago0 comments

Good question. Status sites usually advertise the availability of features. When your service to feature mapping is 1:1 with just a load balancer or a cache in between then it's relatively simple to calculate. The number of 500s on the load balancer, cache, or both indicates errors sent to users. As a company grows several services usually combine to form a single feature; think about how a company has a "sign in" feature. There's likely a service that handles typical username password auth, then one for SSO, one for passkey, etc... at this rate, you have several inputs but the outputs remain somewhat consistent. 500s seen on your most externally facing endpoints are errors to users.

Now combine all of the above with a client that has retry capabilities. That client could be a modern web app or a desktop app. Eventually consistent systems often rely on retry behavior and rate limiting to achieve smooth user transitions. Now I can't simply rely on 500s being sent because they may indicate a timeout or a caching problem. Now I need to rely on statistics on specific endpoints that will definitely result in a user facing error. Collecting that in real-time (real-time enough for alerting, anyway) is challenging as a company at that scale could be dealing with an abundance of requests per second.

When SREs get into an incident they'll often try to determine customer impact in order to know what hemorrhaging to stop first. Looking at a list of 500s in a system like that is often unhelpful, so we'll build dashboards of specific endpoints that show a level of degradation eg: "Show me all requests that did not have 2xx where the number of retries is 3". In my contrived example the client shows an error after the third exponential retry. If you were calculating availability purely off of the number of 500s you're not actually calculating customer impact, you're calculating the number of errors. That said it's a lot easier said than done to build a data system to make a query like what I described, much less to export it. So in order to provide accurate information the status site is updated manually.

On the flip side of what you described, some errors don't have a statistic. For instance, if I force rotate everyone's password and kill logins then I might post that on the status site as well. If it's the result of a security action or vulnerability I might declare the service degraded for a period of time.

0 comments

1010082y ago

Thank you very very much for taking the time to write this explanation. I learnt a lot today :)

j / k navigate · click thread line to collapse