As I assumed it was kind of a corner case bug meet corner case bug met corner case bug.
This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?
Making software is hard....
The real question is if society can handle the unfairness that is death by random software error vs. death by negligent driving. It's easy to blame negligent driving on the driver, we're clearly not negligent so it really doesn't effect us right? But a software error might as well be an act of god, it's something that might actually happen to me!
[1]: https://en.wikipedia.org/wiki/List_of_motor_vehicle_deaths_i...
At least with human drivers, the failures are generally uncorrelated.
Most people have a greater fear of flying than driving by car although statistically you're far more at risk in a car. One cause of that fear of flying is loss of control; you have to accept placing your life in someone else's hands.
With self driving cars suspect lack of control will also be a problem. Either we need to provide passengers with some vestige of control to keep them busy or we just wait a generation until people get used to it
I'm very curious to see if our understanding (as a society) of our own technology will improve over time or if people will continue to blame the internet for "not working" 20 years from now.
We see this ALL the time with ALL the big companies including the ones I have worked for in the past. I am very interested in possible solutions people are cooking up here.
When an accident is inevitable, software will decide if prived or public property should be prioritized, which action is more likely to to protect driver/passenger A in detriment of driver/passenger B, etc.
Most people wouldn't blame the outcome of a split second decision made in heat of the moment but would take issue when the action is deliberate.
Interesting times we live in
But how is Google or any other manufacturer going to test their software updates? Are they going to test-drive their cars for tens of thousands of miles over and over again for every little update?
Outages suck, but are inevitable even for Google. With a response like this Google has gained even more trust from me.
Pair this with the outage tracking tools and you can find all the outages that have happened across Google and what caused them.
Then there is DiRT[0] testing to try and catch problems in a controlled manner. Having things break randomly through Google's infrastructure and you have to see if your service's setup and oncall people handle it properly is a really awesome exercise.
[0] http://queue.acm.org/detail.cfm?id=2371516
The opinions stated here are my own, not necessarily those of Google.
Edit: Changed from saying "all" to "most" postmortems being available to Googlers to see.
With humans, the amount of knowledge gained and the collective improvement of driving behavior from a single accident is low, and each accident mostly provides some data points to tracked statistics. With machines, great systematic improvements are made possible over time such that the remaining edge cases will become increasingly improbable.
I'll have to point that this is necessary, but not sufficient for enabling an ever improving, extremely safe activity.
Aviation also have a just right amount of blame running in the system that is hard to replicate on any other area.
It's estimated that self-driving cars could reduce vehicle crashes by approximately 90%! [4]
[1] http://www-nrd.nhtsa.dot.gov/pubs/812115.pdf [2] http://www-nrd.nhtsa.dot.gov/Pubs/812219.pdf [3] http://www.who.int/violence_injury_prevention/road_safety_st... [4] http://www.mckinsey.com/industries/automotive-and-assembly/o...
People suck at driving. Even a shitty self-driving car will save a ton of lives simply by obeying traffic laws.
Personally, i think automated cars are going to easily be better than humans in the working cases (both human and ai are concious). Next, i expect to see fully operational backup systems.
Eg, if a monitoring system decides that the primary system is failing for whatever reason, be it bug or unhandled road condition (tree/etc), the backup system takes over and solely attempts to get the driver off the road, and into a safe location.
Humans often fail, but often can attempt to recover. And, as bad as we may be at even recovering, we know to try and avoid oncoming traffic. Computers (currently) are very bad at recovering when they fail. I feel like having a computer driving, in the event of failure, is akin to a narcoleptic driver - when it goes wrong, it goes really wrong. Hence why i hope to see a backup system, completely isolated, and fully intent on safely changing lanes or finding a suitable area to pull over.
If you dispute this, please explain.
If your position is that one death is too many, that is illogical relative to the option of letting people drive cars.
I'm not saying it won't get better, but pretending self-driving cars is a cure-all right now is hilarious and insane.
I wouldn't really agree with that. There were two pieces of code designed to perform checks on new configs and cancel them. They both failed. Neither of those checks is a corner case. If you had a spec sheet for the system that manages IP blocks, that functionality would be listed as a feature right up front.
Sounds to me like someone just didn't bother to test the failsafe part of the code.
How many people drive aggressively, speeding, or erratically? How many people do dumb things on the road?
As a software engineer I know that there will be bugs and some will likely kill people. But as a driver who has driven many years in less civilized countries, I know that human beings are terrible drivers.
Who would you rather share the road with, computer drivers that drive like your grandma, or a bunch of humans? It's a no-brainer right?
Yes. However, the current failure rate of human drivers being improved on is the standard I care about.
http://www.cnbc.com/2015/10/29/crash-data-for-self-driving-c...
> After crunching the data, Schoettle and Sivak concluded there's an average of 9.1 crashes involving self-driving vehicles per million miles traveled. That's more than double the rate of 4.1 crashes per one million miles involving conventional vehicles.
That is the only number that matters to me. Google gets that to 4.0 per million miles and I'd say they are good to go.
What is the crashes per miles for a paying attention driver? If it is 1 per million miles, the self driving car would need to be a lot lower. Now if it was 4am and I am falling asleep at the wheel, I bet any self driving car would beat me. So cool to turn on, but maybe not for a daytime cruise...
I'm all for automation, but WTF? Insert even a semi-competent engineer in the loop to monitor the configuration change as it propagates around and the entire problem could have been addressed almost trivially, as the human engineers eventually decided to do.
Secondly, I'm seeing just shy of 500 individual prefixes, 282 directly connected peers (other networks), and a presence at over 100 physical internet exchanges, just for one of Google's four ASes.
Would you be able to read over that configuration and tell me if it has errors?
Google has at least tens of data center locations, each of which will have multiple physical failure domains.
There are also many discontiguous routes being announced at all of their network PoPs. They have substantially more PoPs than data centers.
It very quickly gets too much to reasonably expect people to be able to keep track of what the system should look like, let alone grasping what it does look like.
Formal systems?
Having said that I am still scared, I'm not sure how well Tesla auto pilot will handle a tire blowout at 70mph. Perhaps better than I would, but I would much rather I was in control.
This is a really important point that should be more generally known. To quote Google's own "Paxos Made Live" paper, from 2007:
> In closing we point out a challenge that we faced in testing our system for which we have no systematic solution. By their very nature, fault-tolerant systems try to mask problems. Thus they can mask bugs or configuration problems while insidiously lowering their own fault-tolerance.
As developers we can try to bear this principle in mind, but as Monday's incident demonstrated, mistakes can still happen. So, has anyone managed to make progress toward a "systematic solution" in the last 9 years?
The problem is that these failure cases are exercised much less frequently than the "normal execution" code paths are. For example, every year Google does DiRT [1] exercises which test system responses to a large calamity, eg. a California earthquake that kills everyone in Mountain View and SF including the senior leadership, and also knocks out all west coast datacenters. The half-life of code at Google (in my observation) is roughly 1 year, which means that half of all code has never gone through a DiRT exercise. The same applies to other, less serious fault injection mechanisms: they may get executed once every year or two, and serious bugs can crop up in the meantime. Automated testing of fault injection isn't really feasible, because the number of potential faults grows combinatorially with the number of independent RPCs in the system.
I'd be willing to bet that the two bugs that caused this outage were less than 6 months old. In my tenure at Google, the vast majority of bugs that showed up in postmortems were introduced < 3 months before the outage.
[1] http://everythingsysadmin.com/2012/09/devops-google-reveals-...
Ex: Chubby planned outages Google found that Chubby was consistently over its SLO, and that global Chubby outages would cause unusually bad outages at Google Chubby was so reliable that teams were incorrectly assuming that it would never be down and failing to design systems that account for failures in Chubby Solution: take Chubby down globally when it’s too far above its SLO for a quarter to “show” teams that Chubby can go down
I remember my founder (ex Googler) telling us about fault injection at Google. We were pretty amazed by the idea. Thanks for the link @nostrademons.
That said, based on this post-mortem, I think Google, and our industry as a whole, is doing a pretty good job. Periodic failures like this are inevitable, and if they serve to make it less likely that a similar failure occurs in the future, then that is a system as a whole that could be described as "anti-fragile".
[1] At least my interpretation of them
That depends on how you define "solution". If development time isn't a concern, then formal verification is a pretty solid solution. AWS has used TLA+ on a subset of its systems. [0]
For example, the CAN bus normally has an automatic retry feature on a variety of errors. A properly functioning CAN bus should have a bit error rate that is nearly zero. Lightly loaded, it can tolerate a very high error rate (say, due to noise, poor termination, etc). In that situation, the product would report a specific warning message to higher-level SCADA systems, such that it gets bubbled up all the way to the operators.
One of the bugs in this postmortem was that the process in question didn't do this, instead masking the error. Somewhat understandable, as I found the whole "execute a fallback, report the failure, and let the monitoring rules deal with it" philosophy one of the most confusing parts of being a Noogler. If you've never worked on distributed systems before, the idea that there is a monitoring system is a strange concept.
http://techblog.netflix.com/2011/07/netflix-simian-army.html
Allow me to introduce you to the fantastic and battle-tested http://learnyousomeerlang.com/what-is-otp , preferably utilized (IMHO) via http://elixir-lang.org/
As a general reflection, many distributed system leave out the cause of their changes and only log actions. Instead of logging "new membership, new members are b,c,d" you are better of logging "node a has not responded to heartbeat in the last 30 seconds, considering it faulty". Following such a principle makes it much easier to spot masked bugs, since you can reason about the behaviour much better.
Aggregating logs to a central location and being able to analyze global behaviour in retrospect is also a great feature.
1. Evaluated a configuration change before the change had finished syncing across all configuration files, resulting in rejecting the change.
2. So it tried to reject the change, but actually just deleted everything instead.
3. Something was supposed to catch changes that break everything, and it detected that everything was broken but its attempt to do anything to fix it failed.
It is hard to imagine that this system has good test coverage.
That doesn't mean that bugs can't creep in. Who knows, maybe these were all extremely unlikely bugs and Google hit an astronomically unlikely bad-luck streak. Happens.
I mean, this problem was a result of MULTIPLE untested failure states.
And yes, it IS possible to unit-test this sort of thing. You can fake out network connections and responses. I haven't yet found something that's impossible to unit-test, if you just think about how to do it properly, actually.
EDIT: Why downvotes without a typewritten rebuttal? That's just not what I expect from HN (as opposed to, say, Reddit)
2 and 3 shouldn't have happened. But since they aren't releasing any further details. It would be unfair to rate the system.
For progressive rollouts, what if config changes where pulled instead of pushed?
Each system would be responsible for itself updating, verifying (canary, smoketest, make sure other systems successfully updated, etc), bouncing, and then rolling back as needed.
The problem here was that there was a bug in the health check that masked the problem by assigning the last-good configuration, and then there was a bug in that code that had saved "nothing" as the last-good configuration. So rather than failing and having the error caught at the top level, it failed and buggy failure-recovery code made the problem worse.
Classic Two Generals. "No news is good news," generally isn't a good design philosophy for systems designed to detect trouble. How do we know that stealthy ninjas haven't assassinated our sentries? Well, we haven't heard anything wrong...
anycast "canary test in progress"
edge routers store new configs
anycast "canary test PASS"
edge routers activate new config
edge routers canary test new config (and pass or revert)
edge routers report home that all is well
"You can fly safely, we have canaries and staged deployment"
A year forward:
"Unfortunately because the canary verification as well as the staged deployment code was broken, instead of one crash and 300 dead, an update was pushed to all aircraft, which subsequently caused them to crash, killing 70,000 people."
I'm not 100% sure why they don't do the staged deployment for google scale server networking over a few days (or even weeks in some cases) instead of a few hours, but I don't know the details here...
It's good that they had manually triggerable configuration rollback possibility and a pre-set policy so it was solved so quickly.
As a founder of a startup that hosts services on GCE I'm happy with the trade-off they've chosen.
At some point, delaying the deployment of updates system wide would cause more, not less risks.
On Hacker News the "move fast and break things" ethos is probably making sense for many of the people submitting and commenting, since their business is closer to casual usage anyway. But that's not the whole audience.
These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly
This sounds very unprofessional imho. "Touch this cable to see if there is electricity running" sort of thing.
Is that really how its should be done?
They're a good learning exercise writing one, and is more of a learning exercise than a punishment.
Source: I work on the team that writes these external postmortems.
Sample: https://status.cloud.google.com/incident/appengine/16002
Note that the length of the report tends to correlate with the severity of the outages and that disruptions (code orange) disruptions do not get reports.
Disclaimer: I work in Cloud Support and write some of these.
>Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
I assume the software was originally tested to make sure it works in case of failure. It would be interesting to know exactly what the bug was and why it didn't show in tests.
However, this is a great detailed post-mortem from a service provider. Your Telco or ISP will never provide this much detail...
e.g. I'm sure that we will never hear that Bank of X has transferred a billion dollar to an account but because of propagation errors it published only the credit but didn't finish the debit and now we have two billionaires. This two or more phase commit is pretty much bulletproof in banking as far as I know, and banks are not known to be technologically more advanced than Google, how come internet routing is so prone to errors that can an entire cloud service unavailable for even a small period of time? I'm far from knowing much about networking (although I took some graduate networking courses, I still feel I know practically nothing about it...) So I would appreciate if someone versed in this ELI5 whether it can happen in AWS and Azure regardless of how redundant you are, (which leads to a notion of cross cloud provider redundancy which I'm sure is used in some places) and whether the banking analogy is fair and relevant, and if there are any RFCs to make world-blackout routing nightmares less likely to happen.
EDIT: Also, to answer the question: I think distributed computing is hard. The bank will usually have all their account balances on one huge central mainframe in one location, so you do not need to rely on computers talking to each other. And also, a bank does not really need to publish credits and debits at the same time - they just have to make sure your account is debited at or before the other account is credited (in fact, with most money transfers between banks there will be days between these two). So they can just debit your account, check whether this has worked and then send the money on its journey afterwards and be done with it. If a bug happens and the money does not show up at the recipient, they will complain, the bank can look into it and fix it - no (or not much to the bank, anyways) harm done.
My understanding, from the odd bits and bobs of information I have, is that AWS regions are typically managed somewhat independently.
It's certainly good that they detected it as fast as they did. But I wonder if the fix time could be improved upon? Was the majority of that time spent discussing the corrective action to be taken? Or does it take that much time to replicate the fix?
Rushing to enact a solution can sometimes exacerbate the problem.
If the rollout took 12 hours instead of 4 or the VPN failure to total failure was multiple hours instead of minutes, they'd have had enough time to noodle it out. Eventually at a slow enough deploy rate they'd have figured it out. It only took 18 hours to make the final report after all, so an even slower 24 hour deploy would have been slow enough, if enough resources were allocated.
On the opposite side, most of the time when you screw up routing the punishment is extremely brutal and fast. If the whole thing croaked in five minutes, "OK who hit enter within the last ten minutes..." and five minutes later its all undone. What happened instead was dude hit enter, all is well hours later although average latency was increasing very slowly as anycast sites shut down. Maybe there's even shift change in the middle. Finally hours later it finally all hit the fan meanwhile the guy who hit enter is thinking "it can't be me, I hit enter over four hours ago followed by three hours of normal operation... must be someone else's change or a memory leak or novel cyberattack or ..."
Theoretically if you're going to deploy anycast you could deploy a monitoring tool to traceroute to see that each site is up, however you deploy anycast precisely so that it never drops... Its the titanic effect, why this is unsinkable, why would you bother checking to see if its sinking? And just like the titanic if you break em all in the same accident, that sucker is eventually going down, even if it takes hours to sink.
Of course, the traffic load might have overwhelmed that single datacenter but that would be alleviated as soon as additional datacenters came back online ("announced the prefixes"). A portion of the traffic load would shift to each new datacenter as it came back online.
It could have been hours later before they were all operational again but, as far as the users were concerned, the service was up and running and back to normal as soon as the first one or two datacenters came back up.
e.g. if the detection mechanism latency is ~60s but the time-to-resolve is 18 mins, then I wonder: "how good could the best possible recovery system be?" Implicit in this question is that I think the answer to my question could just as easily be "19 minutes" as it could "5 minutes."
It's not a bias if I'm asking questions in order to improve the system. Could this fault have been predicted? Yes, IMO it could have. I believe that the fault in this case is grossly summarized as "rollback fails to rollback."
What if the major driver of the 18 minute latency was getting the right humans to agree that "execute recovery plan Q" was the right move? If that were the case then perhaps another item to learn could be "recovery policy item 23: when 'rollback fails to rollback', summon at least 3 of 5 Team Z participants and get consensus on recovery plan." And then maybe there could be a corresponding "change policy item 54: changes shall be barred until/unless 5 participants of Team Z are 'available'"
But that's all moot, if "fastest possible recovery [given XYZ constraints of BGP or whatever system] is ~16 minutes." Which it sounds like may indeed be the case.
These credits exceed what is promised by Google Cloud in their SLA's for Compute Engine and VPN service!
That outtage gives GCE at best a four 9's reliability for 2016.
https://status.cloud.google.com/summary
It looks like GCE uptime is well below four 9's reliability for a sliding 1 year timeframe.
"On Tuesday 23 February 2016, for a duration of
10 hours and 6 minutes, 7.8% of Google Compute Engine
projects had reduced quotas. ... Any resources that
were already created were unaffected by this issue."
I'm not sure off the top of my head how I'd try to compute the overall availability #s from that one. One can possibly try to determine and sum the effects on the individual customers, but we can't from the information provided. But it's certainly less overall downtime than just counting it as a 7 hour failure.The other incidents (as far as I can tell), were service disruptions at the AZ/regional level. Those disruptions don't impact the 9's, as GCE was available for other regions.
> Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process
I'm sure the devil is in the details, but generally speaking, these are 2 instances of critical code that gets exercised infrequently, which is a good place for bugs to hide.
The first graph quoted from a survey paper is a classic fitting the GCE outage well:
Initial error --92%--> Incorrect handling of errors explicitly signaled in software
(As background, the author, MIT Prof. Nancy Leveson, summarizes decades of work in the field, offers groundbreaking new theoretical tools that scale up to some of the world's most complex accidents, and has the experience and evidence to back up their relevance e.g. via work on Therac-25, the Columbia Space Shuttle, and Deepwater Horizon to name just a few...)
Always test your crash / exception handling / special case termination+recovery code in production.
I have seen this too often. Most often in in "every day" cases when service has a "nice" catch way of stopping and recovering. Then has a separate "if killed by SIGKILL/immediate power failure" crash and recovery. This last bit never gets tests and run in production.
One day power failure happens, service restart and tries to recover. Code that almost never runs, now runs and the whole thing goes into an unknown broken state.
https://googleblog.blogspot.com/2014/01/todays-outage-for-se...
See: http://danluu.com/postmortem-lessons/
> Configuration > > Configuration bugs, not code bugs, are the most common cause > I’ve seen of really bad outages. When I looked at publicly available > postmortems, searching for “global outage postmortem” returned > about 50% outages caused by configuration changes. Publicly > available postmortems aren’t a representative sample of all > outages, but a random sampling of postmortem databases also > reveals that config changes are responsible for a disproportionate > fraction of extremely bad outages. As with error handling, I’m > often told that it’s obvious that config changes are scary, but > it’s not so obvious that most companies test and stage config > changes like they do code changes.
PS. On HN you should use asterisks to italicize instead of > for quoting.
It's a shame it's not easier or more common for people to create clones of (most|all) of their infrastructure for testing purposes.
Something like half of outages are caused by configuration oopsies.
If you accept that configuration is code, then you also come to the following disturbing conclusion: the usual test environment for critical network-related code in most environments is the production environment.
In an AWS environment, imagine a setup where all that differs is the API keys used (the API keys of the production vs test environment). What gets tricky is dealing with external dependencies, user data, and simulating traffic.
For an example more relevant to today's issue: imagine a second simulated "internet" in a globally distributed lab environment. With BGP configs, fake external BGP sessions, etc, servers receiving production traffic, etc.
I get that it's a lot of work to setup and would require ongoing work to maintain - and that it's hard/impossible to have it correctly simulate the many nuances of real world traffic - and yet I also think in many cases it would be sufficient to prevent issues from making it into production.
Pulling your own worldwide routes because you have too much automation; it will make a good story once it's filtered down a bit! Icarus was barely up in the air, too early for a fall.
"...team...worked in shifts overnight..."
The team in charge of solving this particular problem is located in two sites in two different timezones. This is true of most critical SRE teams at Google, and it is precisely to be able to have 24h coverage in these time sensitive situations.
In the 2+ years I have spent in SRE I have never heard of a single instance of an SRE being asked or even encouraged to stay after hours (let alone overnight) for incident remediation. There is quite a lot of emphasis being put on work/life balance.
configuration files strike again - remember knight capital?
Also, does anyone have a link to statistics on global BGP software usage? I'm curious what the marketshare looks like.
Perhaps the progressive rollout should wait for an affirmative conclusion instead of assuming no news is good news? I'm not being snarky, there may be some reason they don't do this.
PS. To the downvoters, truth hurts.
All this could have been contained if they deployed changes on different regions at different times. That would also help with screwing less your overseas users by running a maintenance at 10am their local time :-)
The system does do progressive rollouts, which are essentially what you are referring to (albeit perhaps at a different pace). The number of changes being rolled out means that it's not really feasible to hand roll out configurations to different regions, so the checks are automated. In this case, the automated checks failed as well.
You are just confirming my previous comment. Your rollouts are automated, so pushing a change automatically configures every region, instead of configuring just one and maybe waiting for a prudential time in human scale before the next one because, surprise!, shit happens.
I understand your colleagues probably make lots of changes, but if that introduces risks of global outages IMHO you should reconsider your strategy.
And I'm not sure why you downvoted my previous comment. It's a perfectly valid observation, based on the published information.