Tell HN: GitHub is down (Update: Back online now)

188 pointsslightknack4y ago64 comments

Getting 500 errors, multiple friends confirming: https://github.com

64 comments

It's like the good old days when the power went down at work and everyone came to the hallways.

Good luck to the engineers at GitHub as I know how stressful it can be, but hope everyone else is enjoying a nice break and some socializing

lnsp4y ago

According to GH Status (https://www.githubstatus.com/), everything is fine. Gotta love functional status pages.

edit: Nevermind, they just reported "degraded performance" for GitHub Actions, Issues, and Pull Requests.

rossmohax4y ago

Didn't they have actual error rate graphs on that page back in a day?

speedgoose4y ago

Yes, and some response times too. It was actually useful.

sc904y ago

They have updated it.

agomez3144y ago

i wonder how they even work? Like, do they just display a green css button instead of _actually_ doing a healthcheck?

aseipp4y ago

Yes they do, because very rarely would such a healthcheck kind of setup actually work in practice, at a large enough size, for a user-facing dashboard. If you want a healthcheck, look at a Grafana dashboard, not a status page.

By the way, I don't know of a single place where this isn't the case, where a human signs off on and updates the status page during large events (at least at the final decision.) Some of it will be automated, sure, like red flags being raised to operators. But at a certain point it is not possible to automate this in some level to achieve second-level accuracy or whatever; the system is rarely (if ever) in a binary state of "working perfectly" or "not working", but somewhere in between. You can't just fire off a big red error bar every time a blip occurs at a place like GitHub. The system is constantly "in motion". The logical conclusion is to just expose your 50+ Grafana dashboards publicly to every user. Isn't that the most honest "overview" of what is happening with your product? Except this often can't tell them useful things either.

People on here will also mumble about SLAs but if a customer wants a kickback or is seriously worried about events like these, they're generally talking to account managers, not posting on internet forums. That said, a lot of them get weaselly about that stuff unless you're already negotiating prices with an AM in the first place...

lamontcg4y ago

When I started work at Amazon in 2001 we had a "gonefishing" page for outages that a human had to flip the site to manually.

We actually stopped doing this a year or two later because reporters were setting up monitoring on that page showing up and were reporting on outage statistics based on it. So we just left the site up in whatever degraded state it was in and that made the problem of measuring www.amazon.com uptime externally that little bit more difficult.

drstewart4y ago

Probably requires manual updates. It seems like more and more places have moved to this paradigm now that status pages are tied to SLAs which are tied to money. One might call it the politicization of status pages.

deathanatos4y ago

Politicization, yes; I've never heard of SLAs being tied to status pages. It is like pulling teeth to get most cloud providers to credit the account when they don't meet SLA, and one always has to ask for it; heaven forbid if credits were paid out automatically when service wasn't rendered.

Or you get weasel-worded out of it. I had a cloud provider deny a service credit; the SLA stated that the service was only out of SLA if it didn't return 2xx. Well, the API returned "2xx Accepted — your request is being processed", and you could use the API to query the job, and the job … never finished or made any progress at all. But the API returned 2xx the entire time, so that was "within SLA".

asciimike4y ago

Correct. SLAs aren't calculated off status pages, there are far better ways of calculating it (running a query over responses, for example). Most modern SLAs are customer initiated anyways, so the customer is writing in to request this rather than automatically calculating them. The status page doesn't need to show anything for a customer to provide logs indicating a QoS less than that promised in the SLA.

I don't think it's politics (maybe AWS's is, but GCP wasn't IMO), it's really a function of "in large scale software systems things are constantly failing in all sorts of ways, and it's really hard to output a meaningful automated signal that things are broken. Sure you can set up pingdom type health checks on every endpoint, but even then you're not necessarily guaranteeing that things are working properly.

Source: worked at a few cloud providers, paid out a few SLA violations

mkl954y ago

AWS do the same thing. Pretty sure someone updates it manually.

romellem4y ago

Incident is up now - https://www.githubstatus.com/incidents/fz1bdbw24y81

> We are investigating reports of degraded performance for GitHub Actions, Issues, and Pull Requests.

xtracto4y ago

Degraded Performance?

Their freaky homepage is borked: https://github.com/ yields a 500 error haha.

AdamJacobMuller4y ago

100% degraded is degraded.

rossmohax4y ago

Negative growth rate is still growth :)

TremendousJudge4y ago

That's a funny way of saying "every page is returning 500"

drewbug014y ago

IIRC, that's the default verbiage that goes out when someone pulls the "oh shit" lever.

asciimike4y ago

github.com is having issues across the site, the API including git operations (and CLI) still work. Status page is manually updated, and we're working to get it updated.

EDIT: it's updated now.

EDIT EDIT: github.com is back up and running, apologies for the disruption :(

Source: GitHub employee

Miner49er4y ago

Why manually update vs automatically?

kxrm4y ago

Not many places I have worked for allowed for this to be automatic. A lot of it was so they could provide a coherent explanation as to what the current state of internal attention was directed at vs what everyone can plainly see.

Miner49er4y ago

I guess this makes sense, but I don't understand why you couldn't have it change status automatically, and still allow a person to go in after and manually add an explanation.

1 more reply

jondwillis4y ago

They wait for the HN thread to appear so they don't give false positives.

JaimeThompson4y ago

I think it is so they can "hide" small outages that don't rise to the level of making the news sites so they can look better. A lot of sites do this sort of thing these days.

asciimike4y ago

Ish? I wrote some stuff up here: https://news.ycombinator.com/item?id=30182591

When I worked at GCP it was all manually updated as well so we could add a sentence about what was actually affected. In any sufficiently large system it's hard to indicate exactly what's broken/how to work around it, so it was just easier to `/status <system> <color> <reason for status>`.

rvz4y ago

Oh dear. Last time a serious incident happened was just 48 hours ago: [0] Now it has gotten critically worse.

Is it time to use a self-hosted backup like what GNOME is using? [1]

[0] https://news.ycombinator.com/item?id=30149071

[1] https://gitlab.gnome.org/GNOME

mkl954y ago

Yup. I've been disrupted by GitHub today, and DockerHub the other day. Crude reminder that the cloud is some company's computer.

xwdv4y ago

My own computer wouldn’t be much better.

hermitdev4y ago

Same. My desktop won't even currently power on after a move... Looks like I get to reseat all of my RAM tonight :)

leesalminen4y ago

I just refreshed a page from GH that I've had open since last night, and yup, 500. Of course I came to HN first before even visiting their own status page as HN always has an update faster than their official page.

sp1rit4y ago

Yes, down in Europe too :/

https://downforeveryoneorjustme.com/github

codingkev4y ago

Yep, down since ~10mins here in central europe.

rsyring4y ago

I'd like to see a graphic of the traffic spike at times like this from CTRL-R. They even encourage it on the 500 page: "try refreshing".

Edit: it's back.

longnguyen4y ago

Looks like only the website is down. I just used the mobile app and it works fine.

mman01144y ago

Yup, confirmed here as well. Multiple team members also having issues.

kitkat_new4y ago

Too bad federated GitLab still does not exist.

joshstrange4y ago

I am able to push code but then I went to check the GH Action to make sure it ran and.... error pages.

lijogdfljk4y ago

Yup. Not even the unauthenticated home page is showing up. Wew

pedrodelfino4y ago

It was down for me too (in Brazil). But it is back now! :)

MatthiasPortzel4y ago

It looks like it's back up now, can anyone confirm?

caterama4y ago

Their 500 page is delightful... More downtime, I say!

dvdhnt4y ago

Right as we were doing a big release, too :)

filippofinke4y ago

It's currently working in Switzerland.

oneil4y ago

Literally just posted a thread.. Doh! :(

gregmfoster4y ago

graphite.dev (alternative frontend to github code review) still up, it appears to be just their frontend.

karussell4y ago

Works again for me here in Germany.

sambhu4y ago

It's up and running now

sc904y ago

Yes, returning a 500.

daveed4y ago

Looks down to me :(

zaphod4prez4y ago

Yep, having issues

sctgrhm4y ago

Same here

devalnor4y ago

down for me too

AdamJacobMuller4y ago

down for me

russian_bot4y ago

can corroborate, very down

tgymnich4y ago

same

j / k navigate · click thread line to collapse

64 comments

nyellin4y ago

It's like the good old days when the power went down at work and everyone came to the hallways.

Good luck to the engineers at GitHub as I know how stressful it can be, but hope everyone else is enjoying a nice break and some socializing

lnsp4y ago

According to GH Status (https://www.githubstatus.com/), everything is fine. Gotta love functional status pages.

edit: Nevermind, they just reported "degraded performance" for GitHub Actions, Issues, and Pull Requests.

rossmohax4y ago

Didn't they have actual error rate graphs on that page back in a day?

speedgoose4y ago

Yes, and some response times too. It was actually useful.

sc904y ago

They have updated it.

agomez3144y ago

i wonder how they even work? Like, do they just display a green css button instead of _actually_ doing a healthcheck?

aseipp4y ago

lamontcg4y ago

When I started work at Amazon in 2001 we had a "gonefishing" page for outages that a human had to flip the site to manually.

drstewart4y ago

deathanatos4y ago

asciimike4y ago

Source: worked at a few cloud providers, paid out a few SLA violations

mkl954y ago

AWS do the same thing. Pretty sure someone updates it manually.

romellem4y ago

Incident is up now - https://www.githubstatus.com/incidents/fz1bdbw24y81

> We are investigating reports of degraded performance for GitHub Actions, Issues, and Pull Requests.

xtracto4y ago

Degraded Performance?

Their freaky homepage is borked: https://github.com/ yields a 500 error haha.

AdamJacobMuller4y ago

100% degraded is degraded.

rossmohax4y ago

Negative growth rate is still growth :)

TremendousJudge4y ago

That's a funny way of saying "every page is returning 500"

drewbug014y ago

IIRC, that's the default verbiage that goes out when someone pulls the "oh shit" lever.

asciimike4y ago

github.com is having issues across the site, the API including git operations (and CLI) still work. Status page is manually updated, and we're working to get it updated.

EDIT: it's updated now.

EDIT EDIT: github.com is back up and running, apologies for the disruption :(

Source: GitHub employee

Miner49er4y ago

Why manually update vs automatically?

kxrm4y ago

Miner49er4y ago

I guess this makes sense, but I don't understand why you couldn't have it change status automatically, and still allow a person to go in after and manually add an explanation.

1 more reply

jondwillis4y ago

They wait for the HN thread to appear so they don't give false positives.

JaimeThompson4y ago

I think it is so they can "hide" small outages that don't rise to the level of making the news sites so they can look better. A lot of sites do this sort of thing these days.

asciimike4y ago

Ish? I wrote some stuff up here: https://news.ycombinator.com/item?id=30182591

rvz4y ago

Oh dear. Last time a serious incident happened was just 48 hours ago: [0] Now it has gotten critically worse.

Is it time to use a self-hosted backup like what GNOME is using? [1]

[0] https://news.ycombinator.com/item?id=30149071

[1] https://gitlab.gnome.org/GNOME

mkl954y ago

Yup. I've been disrupted by GitHub today, and DockerHub the other day. Crude reminder that the cloud is some company's computer.

xwdv4y ago

My own computer wouldn’t be much better.

hermitdev4y ago