Good luck to the engineers at GitHub as I know how stressful it can be, but hope everyone else is enjoying a nice break and some socializing
edit: Nevermind, they just reported "degraded performance" for GitHub Actions, Issues, and Pull Requests.
By the way, I don't know of a single place where this isn't the case, where a human signs off on and updates the status page during large events (at least at the final decision.) Some of it will be automated, sure, like red flags being raised to operators. But at a certain point it is not possible to automate this in some level to achieve second-level accuracy or whatever; the system is rarely (if ever) in a binary state of "working perfectly" or "not working", but somewhere in between. You can't just fire off a big red error bar every time a blip occurs at a place like GitHub. The system is constantly "in motion". The logical conclusion is to just expose your 50+ Grafana dashboards publicly to every user. Isn't that the most honest "overview" of what is happening with your product? Except this often can't tell them useful things either.
People on here will also mumble about SLAs but if a customer wants a kickback or is seriously worried about events like these, they're generally talking to account managers, not posting on internet forums. That said, a lot of them get weaselly about that stuff unless you're already negotiating prices with an AM in the first place...
We actually stopped doing this a year or two later because reporters were setting up monitoring on that page showing up and were reporting on outage statistics based on it. So we just left the site up in whatever degraded state it was in and that made the problem of measuring www.amazon.com uptime externally that little bit more difficult.
Or you get weasel-worded out of it. I had a cloud provider deny a service credit; the SLA stated that the service was only out of SLA if it didn't return 2xx. Well, the API returned "2xx Accepted — your request is being processed", and you could use the API to query the job, and the job … never finished or made any progress at all. But the API returned 2xx the entire time, so that was "within SLA".
I don't think it's politics (maybe AWS's is, but GCP wasn't IMO), it's really a function of "in large scale software systems things are constantly failing in all sorts of ways, and it's really hard to output a meaningful automated signal that things are broken. Sure you can set up pingdom type health checks on every endpoint, but even then you're not necessarily guaranteeing that things are working properly.
Source: worked at a few cloud providers, paid out a few SLA violations
> We are investigating reports of degraded performance for GitHub Actions, Issues, and Pull Requests.
Their freaky homepage is borked: https://github.com/ yields a 500 error haha.
EDIT: it's updated now.
EDIT EDIT: github.com is back up and running, apologies for the disruption :(
Source: GitHub employee
When I worked at GCP it was all manually updated as well so we could add a sentence about what was actually affected. In any sufficiently large system it's hard to indicate exactly what's broken/how to work around it, so it was just easier to `/status <system> <color> <reason for status>`.
Is it time to use a self-hosted backup like what GNOME is using? [1]
Edit: it's back.