Amazon could learn a thing or two from Github in terms of understanding customer expectations.
I do the same thing, often searching Twitter for "aws" or "outage" and find people complaining about the problem which confirms my suspicions. It's a sad state of affairs when you have to do this and Amazon doesn't seem interested in fixing it.
Both Linode and Amazon suck at their status pages (though linode was quite informative about their DDoS outages that started on Christmas). Every amazon issue we've had, the status page only changed once they'd more or less fixed it. As far as I'm concerned their status page is basically useless unless it's an extended outage, at which point it's still basically useless...
Do you mean that "the cloud provider that is bigger than the next 14 combined and whose jargon has spread through the community" doesn't understand what customers are interested in and delivering on that?
The bigger a project gets, the less prioritized something like a status page often seems to get. Larger entities certainly _have_ them but I often see more things interfering as scale grows (this isn't only a MS thing, let me make clear) whether it be domain switches between engineering and social management (status is often via twitter), feeding the status page via a long telemetry/monitoring platform that has some lag, or a high threshold for what "outage" means to avoid flappy notices (at the cost of some false negatives).
I'm not even going to make a value judgement on the tradeoff of these costs at this point, (I certainly wouldn't dismiss it offhand as a net negative although equally it's not all roses) but at the very least I'd observe that something like a status page _CAN_ be serviced very well from an up and comer (for as much as Github is that any more) and it's far from a true statement that bigCOs can't take learnings from improving customer happiness from newer entities. (In fact, I wish that was a more common practice!)
This is also hidden by the fact that Redis is really reliable (in my experience at least). In my experience it usually takes an ops event (like adding more RAM to the redis machine) to realize where a crutch has been developed on Redis in critical paths.
Redis sentinel[0] is the HA solution for redis for quite some time.
Of course Sentinel does not make Redis conceptually different from what it is from the point of view of consistency guarantees during failures. It performs best-effort attempt to select the best slave to retain writes, but under certain failure modes its possible to lose writes during a failover.
This is common with many failover solutions of *SQL systems as well btw. It depends on your use case if this is an affordable risk or not. For most Redis use cases, usually the risk of losing some writes after certain failovers is not a big issue. For other use cases it is, and a store that retains the writes during all the failure scenarios should be used.
You can replicate to read-only instances in a secondary DC and failover. It hurts but it is better than an outage imo.
Redis has been demonstrated[0][1] to lose data under network partitions. This is particularly concerning when discussing the type of partial failure that GitHub reported.
sometimes reading comments on hn makes me laugh out loud.
there's only one reason to not do this, and that's cost. what do you expect them to say about that? i mean really, you think they're going to put that in a blog post:
"Well, the reason we don't have an entire replica of our entire installation is because it costs way too much. In fact, more than double! And so far our uptime is actually 99.99% so there's no way it's worth it! You can forget about that spend! Sorry bros."
So, a very small risk of an hour or so of downtime sometime in the future which will not cause data loss, or tens of thousands of dollars a month for a failover cluster? I wouldn't replicate it either.
If an outage caused 2 hours of read-only access to repos it would still be moderately impactful, but at least we could still build our Go code.
We should collectively be using incidents like this as an opportunity to learn, much like the GitHub team does. Our entire industry is held back by the lack of knowledge sharing when it comes to problem response and the fact that so many companies are terrified of being transparent in the face of failure.
This is very well written retrospective that gives us a glimpse into the internal review that they conducted. Imagine how much we could collectively learn if everyone was fearless about sharing.
That's the first 5 minutes after getting to a computer.
After that it doesn't really matter why they're down. You failover, get the site back up and worry about it later.
Are these systems on a SAN? That's probably the first mistake if so. Redis isn't HA. You're not going to bounce it's block devices over to another server in the event of a failure. That's just a complex, very expensive strategy that introduces a lot of novel ways to shoot yourself in the face. If you're hosting at your own data-center, you use DAS with Redis. Cheaper, simpler. I've never seen an issue where a cabinet power loss caused a JBOD failure (I'm sure it happens, but it's a far from common scenario IME), but then again, locality matters. Don't get overly clever and spread logical systems across cabinets just because you can.
Being involved with this sort of thing more frequently than I'd like to admit, I don't know the exact situation here, but 2h6m isn't necessarily anything to brag about without a lot more context.
What's pretty shameful is that a company with GitHub's resources isn't drilling failover procedures, is ignoring physical segmentation as an availability target (or maybe just got really really unlucky; stuff happens), and doesn't have a backup data-center with BGP or DNS failover. This is all stuff that (in theory if not always in practice), many of their clients wearing a "PCI Compliant" badge are already doing on their own systems.
You bet they busted their ass to get this fixed and shared their learnings with us. I'm extremely grateful for this and yeah it inconvenienced my morning but nothing more.
You make it sound so easy. If it takes the Github folks 2 hours, I can bet it would've taken us much longer.
So thank you GitHub, please keep up the good work!
But wow it is refreshing to hear a company take full responsibility and own up to a mistake/failure and apologize for it.
Like people, all companies will make mistakes and have momentary problems. It's normal. So own up to it and learn how to avoid the mistake in the future.
as an aside I feel that I'm quite fortunate to work in the EST timezone, as their outage apparently started at about 7pm my time. We have a general rule at my company to not deploy after 6pm unless an emergency fix absolutely needs to go up.
I saw the title of the story and said to myself, what outage? :P
Complex systems fail. Period. All the time. Things like the Simian Army are fantastic tools that help you identify a host of problems and remediate them in advance, but they cannot test every combinatorial possibility in a complex distributed system.
At the end of the day, the best defense is to have skilled people who are practiced at responding to problems. GitHub has those in spades, which is why they could respond to a widespread failure of their physical layer in just over 2 hours.
The biggest win with the Simian Army isn't that it improves your redundancy. It's that it gives your people opportunities to _practice_ responses.
I'm really tempted to continue with "a simbian army is actually" but this isn't Reddit so end of comment thread.
Including the principle that if your software breaks, you're the on who has to go get savaged by velociraptors to fix it.
If only things were that easy.
For all the fawning over being provided technical details, this article was pretty light on them.
I don't think Github going down for a couple hours is that big of a deal TBH. But it does seem to expose a few really basic failings in their DR planning IMO.
I also think it's ridiculous that some commenters are trying to frame this as a distributed computing problem. It's not even a clustering problem (apparently). It's just looking at the iDRAC or whatever to see why the server isn't getting past POST and putting your recovery plan into action.
This is white box vanilla stuff that happens to everybody.
That servers had to be rebuilt as part of DR says a lot.
The fact that there was a Redis dependency during bootstrap? Probably a good thing. You know as well as anyone I'm sure the last thing you want is a bunch of processes that only look like they're up. And even if they could not error without their Redis connections, if Redis is used for caching, what's that going to do to availability? Would it be a good thing to have the processes up if they can only handle 10% of the usual load?
Those are details that aren't there.
But complex distributed computing problem this is not. Not as it was presented anyways.
Usually its just cheaper to be down for an hour or two, versus architect for the end of times.
That's an awesome idea. I wish all companies published the firmware releases in simple rss feeds, so everyone could easily integrate them with their trackers.
(If someone's bored, that may be a nice service actually ;) )
I'm getting flashbacks. All of the servers in the DC reboot and NONE of them come online. No network or anything. Even remotely rebooting them again we had nothing. Finally getting a screen (which is a pain in itself) we saw they were all stuck on a grub screen. Grub detected an error and decided not to boot automatically. Needless to say we patched grubbed and removed this "feature" promptly!
Tell us which vendor shipped that firmware, so everyone else can stop buying from them.
So you want Github to open source where they put your git repo and issues? Who cares about that? It's unimportant because regardless they're still the central endpoint to many open source projects, opened or closed source. If you want open source use Gitlab or any other service that sprinkles extra features around git.
I'll never understand this outrage of dependence on Github when you have a distributed version control system. It's not like it should be on github to setup third party repositories for you.
Ofc, Github isn't to blame for this, rather the ones that thought Github would be great to use as a CDN.
When I worry about dependency on GitHub, I'm thinking about not the inconvenient hours of downtime but the larger threat that they might disappear or turn evil.
What I would like to see even more than opensource github would be a standard for spreading over more services. For instance, syncing code, issues, pull requests, wiki, pages, etc between self-hosted gitlab and gitlab.com, or between gitlab.com and github.com. Further, I'd like to see it be easier to use common logins across services.
I don't think we can rely on Github giving us this, but if GitLab would add it between gitlab.com and gitlab ce, that would be a compelling reason to think of switching.
I don't think outages at GitHub are very frequent. This one was lengthy, so it's definitely been on a lot of peoples' minds, but this conversation always comes up when it happens.
I don't think outages at GitHub are very frequent
And yet, some of the entitlement around this outage is incredible. It's as though a community's want to see Github online, is far more relevant than the lack of SLAs and thousand dollar service fees.What would you rather have? A dependency on a bunch of projects with variable hosting of whatever means or all your dependencies hosted with the uptime of GitHub? Having an install fail because some host is down somewhere deep in your nest of dependencies is going happen a lot more if you have more hosts to worry about.
(unless I missed some specific non capacity related outages?)
I was an active BB user a couple years ago, and the project I worked on would hg clone from BB many times a day so I would be the first one to notice a 503 or whatever error coming from their service. Typically I would see one or two outage per month, some last a few minutes, some last several hours. Most of the time the outage impacted git/hg checkout, so I think that was their technical bottleneck.
RFO: A squirrel climbed into a transformer
and a short time later they both blew up.Once it involved fire alarms, which trigger safety shutdowns within a suite. The other involved a failed static switch panel - ie, the things that aren't mean to be able to fail.
I don't mean to be blasphemous, but from a high level, is the performance issues with Ruby (and Rails) that necessitate close binding with Redis (i.e., lots of caching) part of the issue?
It sounds like the fundamental issue is not Ruby, nor Redis, but the close coupling between them. That's sort of interesting.
It has nothing to do with Ruby, or Rails or even Redis. It's just a design flaw of the application, that you often learn the hard way.
I believe the fundamental issue was just that redis availability was taken for granted by app servers so that certain code paths/requests would fail if it wasn't available, rather than merely be slower.
That would have given them immediate co text and not wasting time on DDOS protection
Flea power?
Since the firmware had a bug, bad state could be stored, completely removing power may clear that state and appears to have done so in this case. They may have also needed to pull the backup battery, and reset the firmware settings, but I wouldn't presume that just from the term "flea power."
I have never known what to call this, but have definitely been engaged in draining a few fleas.
Also, I can't believe it's been that long since google answers has been closed..
This doesn't sound very good.
I was recently involved in an outage that occurred because the sama datacenter was hit by lightning three times in a row. Everything was redundant up the wazoo and handled the first two hits just fine, but by the time the power went out for the third time within N minutes, there wasn't enough juice left in some of the batteries!
Now would it be possible to build an automated system that can withstand this? Probably. But would your time & money be better spend worrying about other failure modes? Almost certainly.
It is true that all their sentence is about recovery, however, it is disappointing that they didn't mention anything about a redundant datacenter.
"...but we can take steps to ensure recovery occurs in a fast and reliable manner. We can also take steps to mitigate the negative impact of these events on our users."
The lessons that giants like Netflix have learned about running massive distributed applications show that you cannot avoid failure, and instead must plan for it.
Now, having a single datacenter is not a good plan if you want to give any sort of uptime guarantee, but that's a different point to make.
However, their recovery report didn't mention anything about such a plan.
<< Edited: correct a grammar error.
A failed/failing drive present during cold boot could cause the controller to believe there were no drives present. To add insult to injury, on early BIOS versions this made the UEFI interface inaccessible. The only way to recover from this state was to re-seat the RAID controller.
There were also two bizarre cases where the operating system SSD RAID1 would be wiped and replaced with a NTFS partition after upgrading the controller firmware (and more) on an affected system (hanging/flapping drives). Attempts to enter UEFI caused a fatal crash, but reinstall (over PXE) worked fine. BIOS upgrade from within fresh install restored it.
From the changelog:
Fixes:
- Decreased latency impact for passthrough commands on SATA disks
- Improved error handling for iDRAC / CEM storage functions
- Usability improvements for CTRL-R and HII utilities
- Resolved several cases where foreign drives could not be imported
- Resolved several issues where the presence of failed drives could lead to controller hangs
- Resolved issues with managing controllers in HBA mode from iDRAC / CEM
- Resolved issues with displayed Virtual Disk and Non-RAID Drive counts in BIOS boot mode
- Corrected issue with tape media on H330 where tape was not being treated as sequential device
- resolved an issue where Inserted hard drives might not get detected properly.I seem to recall a recent post on here about how you shouldn't have such hard dependencies. It's good advice.
Incidentally, this type of dependency is unlikely to happen if you have a shared-nothing model (like PHP has, for instance), because in such a system each request is isolated and tries to connect on its own.
The thing that fixed the last problem doesn't always fix the current problem.
There were other learning points such as immediately going into anti DDoS mode and human communication issues that didn't realise or escalate the problem until some time after the issues started occurring.
Firmware issue meant that a large fraction of their servers could not detect the disks on reboot.
This prevented the redis cluster from starting.
They inadvertently have a hard-dependency on redis being up for the majority of their infrastructure to start.
Takeaway: If you run any complex system, ensure that each component is tested for its response to various degrees of failure in peer services, including but not limited to totally unavailable, intermittent connectivity, reduced bandwidth, lossy links, power-cycling peers.
No CI/test process was in place for hardware/firmware combos to ensure they recovered fine from power loss.
Takeaway: If you run a decent-sized cluster, ensure all new hardware ingested is tested through various power state transitions multiple times, and again after firmware updates. With software defined networking now the norm, we have little excuse not to put a machine through its paces on an automated basis before accepting it to run critical infrastructure.
No CI/test process was in place for status advisory processes to ensure they were sufficiently rapid, representative, and automated.
Takeaway: Test your status update processes as you would test any other component service. If humans are involved, drill them regularly.
Infrastructure was too dependent on a single data center.
Takeaway: Analyze worst case failure modes, which are usually entire-site and power, networking or security related. Where possible, never depend on a single site. (At a more abstract level of business, this extends to legal jurisdictions). Don't believe the promises of third party service providers (SLAs).
PS. I am available for consulting, and not expensive.
Edit this is mostly the "DR" part of tldr :P
You're welcome.
I would STFW, but searching for "HA" isn't helpful.
While this was happening at Github, I noticed several other companies facing that same issue at the same time. Atlassian was down for the most part. It could have been an issue with the service github uses, but they won't admit that. Notice they never said what the firmware issue was instead blaming it on 'hardware'.
I think they should be transparent with people about such vulnerability, but I suspect they would never say so because then they would lose revenue.
Here on my blog I talked about this issue: http://julesjaypaulynice.com/simple-server-malicious-attacks...
I think it was some ddos campaign going on over the web.