Post-Mortem for Google Compute Engine’s Global Outage on April 11 (opens in new tab)

(status.cloud.google.com)

799 pointssgrytoyr10y ago350 comments

350 comments

This is a very good Post-Mortem.

As I assumed it was kind of a corner case bug meet corner case bug met corner case bug.

This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?

Making software is hard....

ben_jones10y ago

Self driving cars don't have to be perfect. They just have to be safer then driving is today [1].

The real question is if society can handle the unfairness that is death by random software error vs. death by negligent driving. It's easy to blame negligent driving on the driver, we're clearly not negligent so it really doesn't effect us right? But a software error might as well be an act of god, it's something that might actually happen to me!

[1]: https://en.wikipedia.org/wiki/List_of_motor_vehicle_deaths_i...

gliderShip10y ago

Well No, There is an upper limit on the damage a bad driver can do by say crushing his car with a bus or something like that. Imagine a bug or malware triggered at the same moment world-wide. It could kill millions. So it not as simple as 'It just has to be better than a human'

6 more replies

eslaught10y ago

That assumes that failures are uncorrelated. My personal concern is with correlated failures, like those that occurred in GCE. What if cars from some manufacturer all fail simultaneously in the same way (say, because of a software push that rolled out more aggressively than it just have, just as in the GCE case)? That's the sort of scenario I'm really concerned about.

At least with human drivers, the failures are generally uncorrelated.

harryf10y ago

> The real question is if society can handle the unfairness that is death by random software error vs. death by negligent driving.

Most people have a greater fear of flying than driving by car although statistically you're far more at risk in a car. One cause of that fear of flying is loss of control; you have to accept placing your life in someone else's hands.

With self driving cars suspect lack of control will also be a problem. Either we need to provide passengers with some vestige of control to keep them busy or we just wait a generation until people get used to it

1 more reply

flippyhead10y ago

Yeah and it's funny how people seem to have some level of tolerance for death-by-design-flaws-in–hardware but somehow, software is a different kind of engineering endeavor. My guess is this will persist for some time, but eventually even out.

Fiahil10y ago

As you say, it can be pretty difficult to accept the unfairness of death by random software error. However, this is a situation that already exist today, but understood and acknowledged by a very small portion of people in our societies. For (a non life-threatening) example, thanks to the post-mortem above, we have a pretty deep understanding of what happened and why during this outage; but for some people it could have been a mere "Spotify is down, again, no more music :(".

I'm very curious to see if our understanding (as a society) of our own technology will improve over time or if people will continue to blame the internet for "not working" 20 years from now.

pyre10y ago

Well, this bug took down the entire system. What happens when self-driving software hits a similar bug? I don't think that there is any precedent for that sort of thing with manually driven cars. The scale could easily be larger than 100-car pile-ups due to poor weather conditions.

1 more reply

SubuSS10y ago

I think the bigger worry is not the perfection part: It is the uniformity part. Considering software will be replicated (ignoring the ML ways of driving for now), updated and refreshed en masse, the impact is going to be very severe. A single nut case shoots up one school or his neighbors. The whole world turning into nut cases is going to be a walking dead scenario.

We see this ALL the time with ALL the big companies including the ones I have worked for in the past. I am very interested in possible solutions people are cooking up here.

tambourine_man10y ago

There's also the ethical problem of life and death choices that will have to programmed in advance.

When an accident is inevitable, software will decide if prived or public property should be prioritized, which action is more likely to to protect driver/passenger A in detriment of driver/passenger B, etc.

Most people wouldn't blame the outcome of a split second decision made in heat of the moment but would take issue when the action is deliberate.

Interesting times we live in

xanderjanz10y ago

I imagine we'd treat it like we do getting hit by a drunk driver. Vilify reckless programmers, and don't think about it.

1 more reply

llamataboot10y ago

No because negligent driving doesn't just put the driver at risk - it puts everyone else on the road plus pedestrians at risk

1 more reply

amelius10y ago

> Self driving cars don't have to be perfect. They just have to be safer then driving is today

But how is Google or any other manufacturer going to test their software updates? Are they going to test-drive their cars for tens of thousands of miles over and over again for every little update?

2 more replies

794CD0110y ago

They don't have to be safer than driving is today. They can be significantly less safe while still being an improvement for society because drivers will be able to focus on other activities while travelling instead of wasting that time focusing on driving the car.

2 more replies

nzoschke10y ago

Seconded. Fast recovery of the problem, fast to publish a postmortem, and a very thurough postmortem.

Outages suck, but are inevitable even for Google. With a response like this Google has gained even more trust from me.

kyrra10y ago

Google's take on postmortems is really nice. As the SRE book points out, they are seen as a learning tool for others. Most internal postmortems are available for anyone within the company to see and learn from. As well, they are always blameless. No fingers are pointed at the person who caused the issue in the postmortem. They explain the issue, what happened, and how it can be prevented in the future.

Pair this with the outage tracking tools and you can find all the outages that have happened across Google and what caused them.

Then there is DiRT[0] testing to try and catch problems in a controlled manner. Having things break randomly through Google's infrastructure and you have to see if your service's setup and oncall people handle it properly is a really awesome exercise.

[0] http://queue.acm.org/detail.cfm?id=2371516

The opinions stated here are my own, not necessarily those of Google.

Edit: Changed from saying "all" to "most" postmortems being available to Googlers to see.

pveierland10y ago

It also showcases the great thing about self-driving cars. Even though accidents will happen, when it does there will be plenty of sensor data and logs which can be examined to find the exact cause in a post-mortem. An improvement to the software can then be made, and millions of cars deployed can all effectively learn from a single accident.

With humans, the amount of knowledge gained and the collective improvement of driving behavior from a single accident is low, and each accident mostly provides some data points to tracked statistics. With machines, great systematic improvements are made possible over time such that the remaining edge cases will become increasingly improbable.

marcosdumay10y ago

Just like planes.

I'll have to point that this is necessary, but not sufficient for enabling an ever improving, extremely safe activity.

Aviation also have a just right amount of blame running in the system that is hard to replicate on any other area.

xiphias10y ago

What does post-mortem mean in this context? The software one (after an accident) or the human one (after death)? I think It's crazy that the word gets back the original meaning..

reustle10y ago

I wouldn't worry so much. I'm sure self driving cars are going to save a lot more lives than they are going to end. Humans are terrible drivers, and the software will only get better.

brianwawok10y ago

I know this logically. But emotionally I know how many bugs I have written in my life. I know software devs are human.. aka I know how the sausage is made.

2 more replies

ambago10y ago

Especially since "driver-error" is the cause of 94% of motor vehicle crashes in the U.S.[1], with 32,675 people killed and 2.3 million injured in 2014.[2] Worldwide, motor-vehicle crashes cause over 1.2 million deaths each year and are the leading cause of death for people between the ages of 15-29 years old.[3]

It's estimated that self-driving cars could reduce vehicle crashes by approximately 90%! [4]

[1] http://www-nrd.nhtsa.dot.gov/pubs/812115.pdf [2] http://www-nrd.nhtsa.dot.gov/Pubs/812219.pdf [3] http://www.who.int/violence_injury_prevention/road_safety_st... [4] http://www.mckinsey.com/industries/automotive-and-assembly/o...

1 more reply

VonGuard10y ago

Yeah, remember, auto-pilot in a plane needs to be 100% reliable, or everyone dies. A car needs to be, I dunno, 80%? Compared to a bad human driver, who still drives every damn day, a computer need only be about 60% reliable to be better.

People suck at driving. Even a shitty self-driving car will save a ton of lives simply by obeying traffic laws.

6 more replies

llamataboot10y ago

My consolation in that fact is that weird edge cases happen with human driven cars as well. Someone has a seizure and crashes, or more commonly reaches for a cigarette, the radio, their phone. People hit ice or water and overcorrect their spin. People drive too fast. Etc etc etc. Not even all edge cases, many common modes of failure. I except self-driving cars that kill people will be a huge emotional issue for a lot of people in accepting them, but for me, i just want them to be safer than human drivers, which isn't THAT high of a bar to cross.

fizzbatter10y ago

Yup. People seem to be overly critical with automated car failures.

Personally, i think automated cars are going to easily be better than humans in the working cases (both human and ai are concious). Next, i expect to see fully operational backup systems.

Eg, if a monitoring system decides that the primary system is failing for whatever reason, be it bug or unhandled road condition (tree/etc), the backup system takes over and solely attempts to get the driver off the road, and into a safe location.

Humans often fail, but often can attempt to recover. And, as bad as we may be at even recovering, we know to try and avoid oncoming traffic. Computers (currently) are very bad at recovering when they fail. I feel like having a computer driving, in the event of failure, is akin to a narcoleptic driver - when it goes wrong, it goes really wrong. Hence why i hope to see a backup system, completely isolated, and fully intent on safely changing lanes or finding a suitable area to pull over.

1 more reply

nxzero10y ago

Idea that edge cases in autonomous vehicles would result in 30,000+ deaths a year to me is a stretch.

If you dispute this, please explain.

If your position is that one death is too many, that is illogical relative to the option of letting people drive cars.

ocdtrekkie10y ago

Currently, the self-driving software fails out on a Google Self-Driving Car every 1,500 miles. If the car suddenly stops trying to drive in the road, and the driver isn't attentive (or worse, if Google gets their way and convinces the laws to change so they don't have to have steering wheels) that's a lot of deaths.

I'm not saying it won't get better, but pretending self-driving cars is a cure-all right now is hilarious and insane.

1 more reply

Dylan1680710y ago

> As I assumed it was kind of a corner case bug meet corner case bug met corner case bug.

I wouldn't really agree with that. There were two pieces of code designed to perform checks on new configs and cancel them. They both failed. Neither of those checks is a corner case. If you had a spec sheet for the system that manages IP blocks, that functionality would be listed as a feature right up front.

lugg10y ago

This wasn't an edge case. It was two bugs in two sections of code both designed to recover from a serious problem. It sounds like both sections of code were not tested properly at the very least.

Sounds to me like someone just didn't bother to test the failsafe part of the code.

packetslave10y ago

...and you're basing this on what, exactly? It's easy to pontificate about what people "didn't bother to test" based on zero information.

1 more reply

Thaxll10y ago

self driving cars will be safer than BGP that' s for sure.

clebio10y ago

Oh geez. This comment needs to go into the pantheon of all-time nerdy humor.

eloff10y ago

Yes but how many people drive stoned, drunk, or distracted?

How many people drive aggressively, speeding, or erratically? How many people do dumb things on the road?

As a software engineer I know that there will be bugs and some will likely kill people. But as a driver who has driven many years in less civilized countries, I know that human beings are terrible drivers.

Who would you rather share the road with, computer drivers that drive like your grandma, or a bunch of humans? It's a no-brainer right?

ddispaltro10y ago

Agreed, I think a good postmortem distinguishes great companies from just good companies. The depth on philosophy, reasoning and then action is very digestible.

fweespee_ch10y ago

> This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?

Yes. However, the current failure rate of human drivers being improved on is the standard I care about.

http://www.cnbc.com/2015/10/29/crash-data-for-self-driving-c...

> After crunching the data, Schoettle and Sivak concluded there's an average of 9.1 crashes involving self-driving vehicles per million miles traveled. That's more than double the rate of 4.1 crashes per one million miles involving conventional vehicles.

That is the only number that matters to me. Google gets that to 4.0 per million miles and I'd say they are good to go.

brianwawok10y ago

So crashes like that include drunk drivers, drugged drivers, and texting drivers.

What is the crashes per miles for a paying attention driver? If it is 1 per million miles, the self driving car would need to be a lot lower. Now if it was 4am and I am falling asleep at the wheel, I bet any self driving car would beat me. So cool to turn on, but maybe not for a daytime cruise...

3 more replies

erichocean10y ago

Are BGP updates for Google's own router configurations really so frequent that they can't pay an engineer to at least monitor the propagation of configuration changes? In this case, a human would have instantly seen that the update was a) rejected (as explained in the postmortem), and b) holy shit, WHY DID THE ROUTER CHANGE ITS OWN CONFIGURATION TO BLOW AWAY ALL OF THE GCE ROUTES!?!

I'm all for automation, but WTF? Insert even a semi-competent engineer in the loop to monitor the configuration change as it propagates around and the entire problem could have been addressed almost trivially, as the human engineers eventually decided to do.

dsl10y ago

First of all, BGP is core to Google's load balancing architecture. So within a single datacenter you probably have at least a few dozen devices down stream from each edge router.

Secondly, I'm seeing just shy of 500 individual prefixes, 282 directly connected peers (other networks), and a presence at over 100 physical internet exchanges, just for one of Google's four ASes.

Would you be able to read over that configuration and tell me if it has errors?

1 more reply

DanielDent10y ago

Any sufficiently large system quickly reaches a point where a human has difficulty tracking what the system should look like.

Google has at least tens of data center locations, each of which will have multiple physical failure domains.

There are also many discontiguous routes being announced at all of their network PoPs. They have substantially more PoPs than data centers.

It very quickly gets too much to reasonably expect people to be able to keep track of what the system should look like, let alone grasping what it does look like.

jvolkman10y ago

The human part of any process will also eventually fail, and it's much more difficult to fix human bugs. Better to shoot for full automation.

heisnotanalien10y ago

But a computer could do the same thing. It would be possible to alert on all the routes being blown away?

stcredzero10y ago

This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?

Formal systems?

tremon10y ago

Formal systems are still built on a model of the outside world, not on the world itself FAFAIK. Even if your formal coverage is 100%, you can not anticipate all weird edge cases the real world can come up with.

robmcm10y ago

The key thing to remember is a bug in autonomous driving doesn't mean the car swerves off the road at 100mph. If the software crashes or fails the car can come to a stop quite quickly without harm and allowing for human intervention.

Having said that I am still scared, I'm not sure how well Tesla auto pilot will handle a tire blowout at 70mph. Perhaps better than I would, but I would much rather I was in control.

mikx00710y ago

Assuming bugs are never intentional and mostly random... Maybe instead of one autopilot software, self driving cars of the future will have several, developed by completely different teams. Then a self driving car can take some sort of average or most common output instruction (thus minimizing the risk of random bugs/edge cases...etc.)

jfoster10y ago

As long as the edge case bugs in self driving cars come up less frequently than human error, it's an overall improvement.

ocdtrekkie10y ago

Currently, they don't. Google cars fail every 1,500 miles on average.

2 more replies

yeukhon10y ago

Airplanes are equipped with software and many pilots would turn on auto-pilots after a long take off. Bugs are everywhere, and it's just a matter of time before one is so critical and kill people. So our best bet is better quality assurance through proof and overtesting (do this incrementally!)

Donzo10y ago

A 1-in-one-million occurrence will happen a thousand times each day when you perform a billion operations.

msellout10y ago

Humans also make mistakes in corner cases. I'm no more afraid of an auto-steering car than a human-steered car.

ksou3210y ago

Humans have bugs all the time, you can just faint for no reason while driving . ..

profeta10y ago

why do you think that will wait for cars? https://en.wikipedia.org/wiki/Therac-25

draw_down10y ago

Well, humans driving cars is already a disaster, so.

teraflop10y ago

> There are a number of lessons to be learned from this event -- for example, that the safeguard of a progressive rollout can be undone by a system designed to mask partial failures -- ...

This is a really important point that should be more generally known. To quote Google's own "Paxos Made Live" paper, from 2007:

> In closing we point out a challenge that we faced in testing our system for which we have no systematic solution. By their very nature, fault-tolerant systems try to mask problems. Thus they can mask bugs or configuration problems while insidiously lowering their own fault-tolerance.

As developers we can try to bear this principle in mind, but as Monday's incident demonstrated, mistakes can still happen. So, has anyone managed to make progress toward a "systematic solution" in the last 9 years?

nostrademons10y ago

Google actually does have a systematic solution: fault injection. Google's systems are designed so that you can (manually, if you have the right privileges) tell an RPC to fail regardless of whether it would otherwise have succeeded, and then test the response of the system as a whole.

The problem is that these failure cases are exercised much less frequently than the "normal execution" code paths are. For example, every year Google does DiRT [1] exercises which test system responses to a large calamity, eg. a California earthquake that kills everyone in Mountain View and SF including the senior leadership, and also knocks out all west coast datacenters. The half-life of code at Google (in my observation) is roughly 1 year, which means that half of all code has never gone through a DiRT exercise. The same applies to other, less serious fault injection mechanisms: they may get executed once every year or two, and serious bugs can crop up in the meantime. Automated testing of fault injection isn't really feasible, because the number of potential faults grows combinatorially with the number of independent RPCs in the system.

I'd be willing to bet that the two bugs that caused this outage were less than 6 months old. In my tenure at Google, the vast majority of bugs that showed up in postmortems were introduced < 3 months before the outage.

[1] http://everythingsysadmin.com/2012/09/devops-google-reveals-...

campers10y ago

There was an relevant section in the Google SRE book notes posted here the other day about injecting faults into Chubby, their distributed lock service, which was too reliable.

Ex: Chubby planned outages Google found that Chubby was consistently over its SLO, and that global Chubby outages would cause unusually bad outages at Google Chubby was so reliable that teams were incorrectly assuming that it would never be down and failing to design systems that account for failures in Chubby Solution: take Chubby down globally when it’s too far above its SLO for a quarter to “show” teams that Chubby can go down

LaFolle10y ago

Direct link to DiRT => http://queue.acm.org/detail.cfm?id=2371516

I remember my founder (ex Googler) telling us about fault injection at Google. We were pretty amazed by the idea. Thanks for the link @nostrademons.

peterwwillis10y ago

Testing doesn't detect failure, it only detects the failure of a test. Real failures happen more often than test failures, for the same test on the same code with the same input and output. The best systematic solution would detect real failures, not see what happens when you fail a test.

1 more reply

atomic7710y ago

This is an interesting question, and seems to get to the core of Nassim Taleb's ideas [1] about fragility and the limits of what we can understand, and how many of our attempts to create artificial stability ultimately bring about the opposite.

That said, based on this post-mortem, I think Google, and our industry as a whole, is doing a pretty good job. Periodic failures like this are inevitable, and if they serve to make it less likely that a similar failure occurs in the future, then that is a system as a whole that could be described as "anti-fragile".

[1] At least my interpretation of them

woodman10y ago

> So, has anyone managed to make progress toward a "systematic solution" in the last 9 years?

That depends on how you define "solution". If development time isn't a concern, then formal verification is a pretty solid solution. AWS has used TLA+ on a subset of its systems. [0]

[0] https://en.wikipedia.org/wiki/TLA%2B

brandmeyer10y ago

The standard solution in realtime safety-critical systems is to perform health monitoring in addition to robust fallbacks, such that when the system is falling back, it is reported as unhealthy.

For example, the CAN bus normally has an automatic retry feature on a variety of errors. A properly functioning CAN bus should have a bit error rate that is nearly zero. Lightly loaded, it can tolerate a very high error rate (say, due to noise, poor termination, etc). In that situation, the product would report a specific warning message to higher-level SCADA systems, such that it gets bubbled up all the way to the operators.

nostrademons10y ago

The approach at Google is to report the actual error rate up to the monitoring system, and then let the monitoring system decide at what threshold to alert with a warning message. This lets you catch a wide variety of errors, eg. if a single replica has a high error rate, that's probably a wildly different problem from if a whole rack of machines has a high error rate, which is different from every machine in the service having a high error rate, which is different from only the set of machines that were fed a specific query having a high error rate.

One of the bugs in this postmortem was that the process in question didn't do this, instead masking the error. Somewhat understandable, as I found the whole "execute a fallback, report the failure, and let the monitoring rules deal with it" philosophy one of the most confusing parts of being a Noogler. If you've never worked on distributed systems before, the idea that there is a monitoring system is a strange concept.

dfox10y ago

CAN and it's automatic retransmit is actually pretty good example of how simple transient problems can quickly overgrow into global system failures. On typical CAN the bandwidth headroom is small enough that when all colliding/failed telegrams would be blindly retransmitted the collision rate would skyrocket and thus only high-priority traffic will make any progress, and as on CAN priority and purpose is intrinsically linked together, from the global point of view nothing will make progress. That's why most CAN controllers have configurable retransmit behavior per packet (drop/retry/raise error and application deal with that) and partially why today's cars have multiple CAN buses.

nkristoffersen10y ago

The best example so far regarding fault tolerance. The Netflix Simian Army. Introduce failures constantly!

http://techblog.netflix.com/2011/07/netflix-simian-army.html

pmarreck10y ago

> So, has anyone managed to make progress toward a "systematic solution" in the last 9 years?

Allow me to introduce you to the fantastic and battle-tested http://learnyousomeerlang.com/what-is-otp , preferably utilized (IMHO) via http://elixir-lang.org/

strictfp10y ago

Degraded modes of operation is one example of how to visualize masked errors. Another is to trigger an alarm on fallbacks.

As a general reflection, many distributed system leave out the cause of their changes and only log actions. Instead of logging "new membership, new members are b,c,d" you are better of logging "node a has not responded to heartbeat in the last 30 seconds, considering it faulty". Following such a principle makes it much easier to spot masked bugs, since you can reason about the behaviour much better.

Aggregating logs to a central location and being able to analyze global behaviour in retrospect is also a great feature.

senderista10y ago

Great point! More precisely, each state transition in a system should report the old state, the new state, and the triggering event (cause) to a monitoring system (possibly just a log).

cosud10y ago

Great writeup! PS: "To make error is human. To propagate error to all server in automatic way is devops." -DevOps Borat

cjbprime10y ago

It looks like there were at least three catastrophic bugs present:

1. Evaluated a configuration change before the change had finished syncing across all configuration files, resulting in rejecting the change.

2. So it tried to reject the change, but actually just deleted everything instead.

3. Something was supposed to catch changes that break everything, and it detected that everything was broken but its attempt to do anything to fix it failed.

It is hard to imagine that this system has good test coverage.

mjibson10y ago

I'm attempting to even imagine how one would build a useful way to test this. Would they have to have a secondary, world-wide datacenter network with all their various services behind it?

Rapzid10y ago

Yes, in a manner of speaking; physical or virtual lab. At googles scale it wouldn't be unreasonable to have a completely parallel, but scaled back, network where they test their automation and code for happy and sad path.

That doesn't mean that bugs can't creep in. Who knows, maybe these were all extremely unlikely bugs and Google hit an astronomically unlikely bad-luck streak. Happens.

awinter-py10y ago

better test -- if you have seldom-exercised edge-case functionality that you can't figure out how to test, remove those features.

ikeboy10y ago

You could have it send messages to the actual servers, but with an added flag that says "fake", which makes the servers ignore the message/send back a message saying pass/fail/whatever (testing the flag could happen first, one server at a time manually). Then check whether the program continued to push updates.

2 more replies

pmarreck10y ago

You fake out the connection with a faker object and give that to the code that wants to communicate to the network, and it returns streamed, deterministic data that would have been expected from the actual network, given deterministic inputs. The test uses the fake; the production code gets given the real object.

manquer10y ago

While testing would have been quite difficult, any simple canary release or timed release mechanism would have prevented this / limited the damage. At such mission critical systems, applying any global change in a such manner is asking for it, Devops can also be SPOF, this seems one such case.

3 more replies

drstewart10y ago

Seriously. This is a good postmortem, but these are hardly edge case bugs. In this case, major critical functionality just plain didn't work. Kind of shocking.

kevan10y ago

They explained the issues in laymans terms that most likely mask the true complexity of what happened. It's easy to read the final result: "tried to reject but then deleted everything" and think "Well duh that's bad, who would build a system that does that?", but I think you're fooling yourself if you think that edge cases couldn't cause that.

1 more reply

pmarreck10y ago

The take-home here is: Unit-test your failure states as well, people. Not just your happy paths!

I mean, this problem was a result of MULTIPLE untested failure states.

And yes, it IS possible to unit-test this sort of thing. You can fake out network connections and responses. I haven't yet found something that's impossible to unit-test, if you just think about how to do it properly, actually.

EDIT: Why downvotes without a typewritten rebuttal? That's just not what I expect from HN (as opposed to, say, Reddit)

agnivade10y ago

I feel 1 is a timing related issue. Which are extremely hard to get right, and might happen.

2 and 3 shouldn't have happened. But since they aren't releasing any further details. It would be unfair to rate the system.

specialist10y ago

Spit balling here...

For progressive rollouts, what if config changes where pulled instead of pushed?

Each system would be responsible for itself updating, verifying (canary, smoketest, make sure other systems successfully updated, etc), bouncing, and then rolling back as needed.

nostrademons10y ago

A bunch of that's in-place already, eg. all Google servers have health checks that run basic smoke tests on a configuration, and if a large number of replicas become unhealthy after a config change, the rollout process automatically aborts and rolls back to the last known good conversion.

The problem here was that there was a bug in the health check that masked the problem by assigning the last-good configuration, and then there was a bug in that code that had saved "nothing" as the last-good configuration. So rather than failing and having the error caught at the top level, it failed and buggy failure-recovery code made the problem worse.

stcredzero10y ago

In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.

Classic Two Generals. "No news is good news," generally isn't a good design philosophy for systems designed to detect trouble. How do we know that stealthy ninjas haven't assassinated our sentries? Well, we haven't heard anything wrong...

fixermark10y ago

It may not be good design, but it might be necessary / practical design. If you have enough machines that some percentage of them are down or unreachable at any given time, you can't wait for full go-ahead before proceeding; you'll never get full go-ahead. So you're left with probabilistic solutions, and as T approaches infinity the expectation of more than zero false-positives approaches 1.

stcredzero10y ago

The whole point of the canary sub-population, though is that 1) It's not your whole population. 2) You want to find out empirically if something's wrong.

avs73310y ago

this was my exact thought...it would seem both feasable and reasonable to have a more active canary process i.e....

anycast "canary test in progress"

edge routers store new configs

anycast "canary test PASS"

edge routers activate new config

edge routers canary test new config (and pass or revert)

edge routers report home that all is well

Gravityloss10y ago

I'm waiting for the time when they push over the air updates to airplanes in flight.

"You can fly safely, we have canaries and staged deployment"

A year forward:

"Unfortunately because the canary verification as well as the staged deployment code was broken, instead of one crash and 300 dead, an update was pushed to all aircraft, which subsequently caused them to crash, killing 70,000 people."

I'm not 100% sure why they don't do the staged deployment for google scale server networking over a few days (or even weeks in some cases) instead of a few hours, but I don't know the details here...

It's good that they had manually triggerable configuration rollback possibility and a pre-set policy so it was solved so quickly.

kentonv10y ago

The answer, of course, is that slower and less-frequent deployments mean slower progress building a better platform and delivering new features. If breakages could lead to plane crashes then, obviously, we'd want them to slow down. But if it mainly means no one can listen to Spotify for 15 minutes then that calls for a different trade-off.

As a founder of a startup that hosts services on GCE I'm happy with the trade-off they've chosen.

nxzero10y ago

Comparing the risk of a live update to a system lives depend on to the risk of some Google services going down is irrational.

At some point, delaying the deployment of updates system wide would cause more, not less risks.

Gravityloss10y ago

There are businesses that fit somewhere between Boeing and Spotify where failures still have some kind of steeper than casual cost.

On Hacker News the "move fast and break things" ethos is probably making sense for many of the people submitting and commenting, since their business is closer to casual usage anyway. But that's not the whole audience.

1 more reply

mgrennan10y ago

It's OK - We (developers) are not liable for software bugs!

joering210y ago

Yeah ,the part with canary code rub me the wrong way too.

These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly

This sounds very unprofessional imho. "Touch this cable to see if there is electricity running" sort of thing.

Is that really how its should be done?

oconnor66310y ago

If you're doing electrical work, eventually you're going to have to touch the cable!

grogers10y ago

Think of the canary as the last line of defense, not the first. You always aspire to deploy zero bugs into production, through good testing and other QA. But if a problem happens, you want to limit the impact as much as possible. Affecting one site isn't great, but there is enough redundancy that overall service should be unaffected.

senderista10y ago

Yes. In a sufficiently complex environment, it's impossible to avoid deploying bugs to production. You can only hope to mitigate their impact.

ndesaulniers10y ago

At Google, they do these really awesome post-mortems when there's a major failure. It provides a point of reflection, and are usually well written entertaining reads. Didn't know they made (some?) public.

They're a good learning exercise writing one, and is more of a learning exercise than a punishment.

advisedwang10y ago

It's worth noting that the publicly posted postmortem is not the same as the internal postmortems (which include much more detail, specific action items, timelines etc). The SRE book (https://landing.google.com/sre/book.html) has a whole chapter on our internal postmortems, which is probably a better learning exercise in how to write one.

Source: I work on the team that writes these external postmortems.

jpatokal10y ago

Google publishes a public incident report for all service outages (code red) in the Cloud status dashboard. You can see some in the History page: https://status.cloud.google.com/summary

Sample: https://status.cloud.google.com/incident/appengine/16002

Note that the length of the report tends to correlate with the severity of the outages and that disruptions (code orange) disruptions do not get reports.

Disclaimer: I work in Cloud Support and write some of these.

dylanz10y ago

Completely off topic, but this thread is an example of why I (and a lot of people) want collapsible comments native to HN. I'm on my phone, in Safari, and I had to scroll for over 20 seconds just to reach the second comment. The first comment was a tangent about self-driving cars, which while relevant, I didn't want to read about.

OldSchoolJohnny10y ago

Especially considering that nearly every post on HN features an often tangential first comment that goes on and on and on...

ikeboy10y ago

>However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.

>Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.

I assume the software was originally tested to make sure it works in case of failure. It would be interesting to know exactly what the bug was and why it didn't show in tests.

djfergus10y ago

Network management software complexity is supposed to be one of things that SDN was built to solve (by introducing more modularity and defined interfaces). But in this case the fault was at the edge with BGP route updates, which the internet has been doing for decades. I share your curiosity in the specific bug.

However, this is a great detailed post-mortem from a service provider. Your Telco or ISP will never provide this much detail...

pjlegato10y ago

Attention startups: this is what incident post-mortems should look like.

wyldfire10y ago

Many startups probably don't have anywhere near the same level of SLA nor revenue of GCE.

pjlegato10y ago

Of course. You should still have a goal to strive towards.

eranation10y ago

This is very interesting. From the little I understand (sorry for using AWS terms as I am more versed with AWS than GCE) this can happen to AWS as well right? even if your software is deployed to multiple AZs / multiple regions, if bad routing / network configuration makes it through the various protection mechanisms then basically no amount of redundancy can help if your service is part of the non functional IP block. I mean it seems no matter how redundant you are, there will always be somewhere along the line a single point of failure, even if it has multiple mechanism to prevent it from happening, if all of these mechanisms fail, then it's still a single point. What prevents this from happening at Azure / AWS? Is there anything that general internet routing protocols need to change to prevent it from happening?

e.g. I'm sure that we will never hear that Bank of X has transferred a billion dollar to an account but because of propagation errors it published only the credit but didn't finish the debit and now we have two billionaires. This two or more phase commit is pretty much bulletproof in banking as far as I know, and banks are not known to be technologically more advanced than Google, how come internet routing is so prone to errors that can an entire cloud service unavailable for even a small period of time? I'm far from knowing much about networking (although I took some graduate networking courses, I still feel I know practically nothing about it...) So I would appreciate if someone versed in this ELI5 whether it can happen in AWS and Azure regardless of how redundant you are, (which leads to a notion of cross cloud provider redundancy which I'm sure is used in some places) and whether the banking analogy is fair and relevant, and if there are any RFCs to make world-blackout routing nightmares less likely to happen.

ar010y ago

Of course things such as mismatched account balances (i.e. the account balance does not equal all credits - debits) or erroneous postings due to bugs happen in the banking IT world. It's just that they are not that visible because only the people affected and that are quick enough in checking their balances will learn about it and after a few hours or days, when they notice the mistake, they will just fix the entries. (And if you thought you were clever and transferred all the funny money away, they are going to sue you to get it back; see e.g. this Quora thread for bank errors and legality: https://www.quora.com/If-my-bank-mistakenly-deposits-1-000-0...)

EDIT: Also, to answer the question: I think distributed computing is hard. The bank will usually have all their account balances on one huge central mainframe in one location, so you do not need to rely on computers talking to each other. And also, a bank does not really need to publish credits and debits at the same time - they just have to make sure your account is debited at or before the other account is credited (in fact, with most money transfers between banks there will be days between these two). So they can just debit your account, check whether this has worked and then send the money on its journey afterwards and be done with it. If a bug happens and the money does not show up at the recipient, they will complain, the bank can look into it and fix it - no (or not much to the bank, anyways) harm done.

poooogles10y ago

I'm not sure the AWS network follows the same setup, AWS has very distinct blocks between the US/EU/APAC compared to GCP where you can inherit the same IP if you quickly delete/recreate instances in different regions?

Swannie10y ago

I was going to post the same comment too.

My understanding, from the odd bits and bobs of information I have, is that AWS regions are typically managed somewhat independently.

cavisne10y ago

AWS dont attempt to do anycast, which is annoying but also means this can't happen

wyldfire10y ago

> . Internal monitors generated dozens of alerts in the seconds after the traffic loss became visible at 19:08 ... revert the most recent configuration changes ... the time from detection to decision to revert to the end of the outage was thus just 18 minutes.

It's certainly good that they detected it as fast as they did. But I wonder if the fix time could be improved upon? Was the majority of that time spent discussing the corrective action to be taken? Or does it take that much time to replicate the fix?

toomuchtodo10y ago

> But I wonder if the fix time could be improved upon?

Rushing to enact a solution can sometimes exacerbate the problem.

VLM10y ago

Having worked in ISP operations on BGP stuff (admittedly more than 10 years ago), it was both too slow and too fast.

If the rollout took 12 hours instead of 4 or the VPN failure to total failure was multiple hours instead of minutes, they'd have had enough time to noodle it out. Eventually at a slow enough deploy rate they'd have figured it out. It only took 18 hours to make the final report after all, so an even slower 24 hour deploy would have been slow enough, if enough resources were allocated.

On the opposite side, most of the time when you screw up routing the punishment is extremely brutal and fast. If the whole thing croaked in five minutes, "OK who hit enter within the last ten minutes..." and five minutes later its all undone. What happened instead was dude hit enter, all is well hours later although average latency was increasing very slowly as anycast sites shut down. Maybe there's even shift change in the middle. Finally hours later it finally all hit the fan meanwhile the guy who hit enter is thinking "it can't be me, I hit enter over four hours ago followed by three hours of normal operation... must be someone else's change or a memory leak or novel cyberattack or ..."

Theoretically if you're going to deploy anycast you could deploy a monitoring tool to traceroute to see that each site is up, however you deploy anycast precisely so that it never drops... Its the titanic effect, why this is unsinkable, why would you bother checking to see if its sinking? And just like the titanic if you break em all in the same accident, that sucker is eventually going down, even if it takes hours to sink.

djfergus10y ago

Hmm. Seems like this begs for a different way to solve the problem, like alarming on major changes to configuration files or better recognition of invalid configs, i.e. google should be able to make a rule that says "if I ever blackhole x% of my network then alarm"...

1 more reply

Sanddancer10y ago

From the rest of the post, it sounds like replication time. Datacenters started dropping an hour beforehand one by one, and they had all fallen over by 19:08. Given that you have to push the rollback to routers around the world, and that peer routers have to propagate the changes from there, 18 minutes for a change like this sounds about right.

jlgaddis10y ago

... although once the first datacenter once again announced the prefixes into BGP, those networks would have been reachable again, from everywhere. I imagine this is what happened at 19:27 -- the first datacenter came back online.

Of course, the traffic load might have overwhelmed that single datacenter but that would be alleviated as soon as additional datacenters came back online ("announced the prefixes"). A portion of the traffic load would shift to each new datacenter as it came back online.

It could have been hours later before they were all operational again but, as far as the users were concerned, the service was up and running and back to normal as soon as the first one or two datacenters came back up.

clebio10y ago

This sounds like hindsight bias.

wyldfire10y ago

Well, perhaps, but -- to be clear, I'm not suggesting that there is a problem, rather gathering more information in order to determine whether there's a more optimal solution.

e.g. if the detection mechanism latency is ~60s but the time-to-resolve is 18 mins, then I wonder: "how good could the best possible recovery system be?" Implicit in this question is that I think the answer to my question could just as easily be "19 minutes" as it could "5 minutes."

It's not a bias if I'm asking questions in order to improve the system. Could this fault have been predicted? Yes, IMO it could have. I believe that the fault in this case is grossly summarized as "rollback fails to rollback."

What if the major driver of the 18 minute latency was getting the right humans to agree that "execute recovery plan Q" was the right move? If that were the case then perhaps another item to learn could be "recovery policy item 23: when 'rollback fails to rollback', summon at least 3 of 5 Team Z participants and get consensus on recovery plan." And then maybe there could be a corresponding "change policy item 54: changes shall be barred until/unless 5 participants of Team Z are 'available'"

But that's all moot, if "fastest possible recovery [given XYZ constraints of BGP or whatever system] is ~16 minutes." Which it sounds like may indeed be the case.

obulpathi10y ago

> Finally, to underscore how seriously we are taking this event, we are offering GCE and VPN service credits to all impacted GCP applications equal to (respectively) 10% and 25% of their monthly charges for GCE and VPN.

These credits exceed what is promised by Google Cloud in their SLA's for Compute Engine and VPN service!

duskwuff10y ago

... which is precisely (almost word-for-word) what the post-mortem goes on to say. Is there something specific you're trying to call attention to here?

obulpathi10y ago

Nop. Probably did too much copy-pasting :( Mearly wanted to highlight the point.

icebraining10y ago

Only barely. They're down to 2.5 minutes of downtime left for the next 30 days if they want to keep the 99.95% level.

mikecb10y ago

They do SLO by quarter.

1 more reply

balls18710y ago

Nice post mortem.

That outtage gives GCE at best a four 9's reliability for 2016.

daveguy10y ago

Based on the higher level status page:

https://status.cloud.google.com/summary

It looks like GCE uptime is well below four 9's reliability for a sliding 1 year timeframe.

dgacmu10y ago

Traynor was quoted in a networkworld article last year saying they aim for three and a half nines (99.95%). But you need to read into the incidents more carefully -- figuring out actual "uptime" is quite hard. Consider the longest-lasting incident:

  "On Tuesday 23 February 2016, for a duration of 
   10 hours and 6 minutes, 7.8% of Google Compute Engine
   projects had reduced quotas.  ...  Any resources that
   were already created were unaffected by this issue."

I'm not sure off the top of my head how I'd try to compute the overall availability #s from that one. One can possibly try to determine and sum the effects on the individual customers, but we can't from the information provided. But it's certainly less overall downtime than just counting it as a 7 hour failure.

1 more reply

balls18710y ago

April's incident is unique, This was the only case (listed) that was a service outtage, which impacted all of GCE.

The other incidents (as far as I can tell), were service disruptions at the AZ/regional level. Those disruptions don't impact the 9's, as GCE was available for other regions.

huula10y ago

I always like Google's serious attitude towards engineering, even after they have made some mistakes, they never try to hide anything.

totally10y ago

> However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration

> Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process

I'm sure the devil is in the details, but generally speaking, these are 2 instances of critical code that gets exercised infrequently, which is a good place for bugs to hide.

pbreit10y ago

Do SLAs even matter in the slightest? Or are they just sort of "feel-good" things or ways for negotiators to demonstrate their worth?

duskwuff10y ago

SLAs aren't about guaranteeing uptime. They're about setting consequences for downtime.

cbr10y ago

But once there are strong consequences for downtime the service provider is going to set up training, monitoring, oncall, etc to make sure things stay within the SLA limits. So you are effectively negotiating uptime.

qaq10y ago

The only SLAs that matter are the ones where service provider will suffer serious $ penalties on braking the SLA. Which rules out basically all major cloud providers that will simply issue credit for the downtime.

heisenbit10y ago

"Lessons learned from reading post-mortems" http://danluu.com/postmortem-lessons/ is a good place to dig deeper

The first graph quoted from a survey paper is a classic fitting the GCE outage well:

Initial error --92%--> Incorrect handling of errors explicitly signaled in software

anoncept10y ago

https://mitpress.mit.edu/books/engineering-safer-world is also an excellent resource that more people who care about post-mortems should read.

(As background, the author, MIT Prof. Nancy Leveson, summarizes decades of work in the field, offers groundbreaking new theoretical tools that scale up to some of the world's most complex accidents, and has the experience and evidence to back up their relevance e.g. via work on Therac-25, the Columbia Space Shuttle, and Deepwater Horizon to name just a few...)

simonebrunozzi10y ago

I love his signature: "Benjamin Treynor Sloss | VP 24x7".

rdtsc10y ago

> However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.

Always test your crash / exception handling / special case termination+recovery code in production.

I have seen this too often. Most often in in "every day" cases when service has a "nice" catch way of stopping and recovering. Then has a separate "if killed by SIGKILL/immediate power failure" crash and recovery. This last bit never gets tests and run in production.

One day power failure happens, service restart and tries to recover. Code that almost never runs, now runs and the whole thing goes into an unknown broken state.

senderista10y ago

See https://en.wikipedia.org/wiki/Crash-only_software

halayli10y ago

This isn't the first time a config system at Google causes a major outage.

https://googleblog.blogspot.com/2014/01/todays-outage-for-se...

rrdharan10y ago

That's entirely unsurprising. The recent major Facebook outage was also caused by bad configuration IIRC.

See: http://danluu.com/postmortem-lessons/

> Configuration > > Configuration bugs, not code bugs, are the most common cause > I’ve seen of really bad outages. When I looked at publicly available > postmortems, searching for “global outage postmortem” returned > about 50% outages caused by configuration changes. Publicly > available postmortems aren’t a representative sample of all > outages, but a random sampling of postmortem databases also > reveals that config changes are responsible for a disproportionate > fraction of extremely bad outages. As with error handling, I’m > often told that it’s obvious that config changes are scary, but > it’s not so obvious that most companies test and stage config > changes like they do code changes.

contingencies10y ago

Great link there! Also check out his list of public postmortems at https://github.com/danluu/post-mortems

PS. On HN you should use asterisks to italicize instead of > for quoting.

DanielDent10y ago

My post yesterday seems even more relevant today: https://news.ycombinator.com/item?id=11477552

It's a shame it's not easier or more common for people to create clones of (most|all) of their infrastructure for testing purposes.

Something like half of outages are caused by configuration oopsies.

If you accept that configuration is code, then you also come to the following disturbing conclusion: the usual test environment for critical network-related code in most environments is the production environment.

aiiane10y ago

The main issue there is that "environments" are defined by configuration, so if you try to set up a configuration test environment, you run into a direct logical impass: either your configs are production configs, and thus not a separate environment, or they're different from production configs, and thus may provide different test results from production.

DanielDent10y ago

While I agree with you, I think we could get closer to "production" than is common right now.

In an AWS environment, imagine a setup where all that differs is the API keys used (the API keys of the production vs test environment). What gets tricky is dealing with external dependencies, user data, and simulating traffic.

For an example more relevant to today's issue: imagine a second simulated "internet" in a globally distributed lab environment. With BGP configs, fake external BGP sessions, etc, servers receiving production traffic, etc.

I get that it's a lot of work to setup and would require ongoing work to maintain - and that it's hard/impossible to have it correctly simulate the many nuances of real world traffic - and yet I also think in many cases it would be sufficient to prevent issues from making it into production.

zaroth10y ago

For the amount this cost them, they should have bought CloudFlare. If you play with [global BGP anycast] you are bound to get burned. This is not the first time that BGP took out your entire routing. This is probably not the last time that BPG will take out your entire routing. Whoever's job it was to watch the routing, I am sorry.

Pulling your own worldwide routes because you have too much automation; it will make a good story once it's filtered down a bit! Icarus was barely up in the air, too early for a fall.

swills10y ago

The thing that stood out for me was:

"...team...worked in shifts overnight..."

delroth10y ago

(Usual disclaimer: I speak for myself, not for my employer, etc.)

The team in charge of solving this particular problem is located in two sites in two different timezones. This is true of most critical SRE teams at Google, and it is precisely to be able to have 24h coverage in these time sensitive situations.

In the 2+ years I have spent in SRE I have never heard of a single instance of an SRE being asked or even encouraged to stay after hours (let alone overnight) for incident remediation. There is quite a lot of emphasis being put on work/life balance.

senderista10y ago

Wow, that's amazing to read, having served as a de-facto SRE (like every other SDE) at an unnamed competitor to GCE, where I was expected to stay up all night if necessary to resolve an issue (relatively few teams had follow-the-sun coverage). I swore I would never carry a pager again after that, but maybe Google really is different.

grogers10y ago

How important for redundancy/quality of service is the feature of advertising each region's IP blocks from multiple points in Google's network? It seems like region isolation is the most important quality that Google's network could provide, and their current design is what made something like this possible, not just the bugs in the configuration propagation. They mention the ability of the internet to route around failures, so why not rely on that instead?

trhway10y ago

as devops Borat was saying all along, automated propagation of a error as the main root cause here. A error (new configuration) should be rolled out site by site - ok us-east1, move onto us-west1 ... ok, move onto ... . A canary site may be the first in sequence, yet success ("no failure reported") can't be a big "ok" for automated push to all sites at the same time.

mjevans10y ago

I hope that one of their solutions is the obvious one; make change control testing a closed loop instead of an open loop. (Watch for /success/ reported instead of failure notification.)

platz10y ago

> configuration file

configuration files strike again - remember knight capital?

nickysielicki10y ago

What does Google use for BGP? Quagga, OpenBGPD, BIRD, their own?

Also, does anyone have a link to statistics on global BGP software usage? I'm curious what the marketshare looks like.

kijiki10y ago

Google has contributed ISIS and BGP code to Quagga in the past, as well as funding some testing at the OSRF. Presumably they use it in at least some parts of their operations.

Tistel10y ago

The postmortem used the word "quirk." They might consider drilling down on the specifics there. Especially if that is the heart of the bug/accident.

JustUhThought10y ago

Just a thought. Maybe change the name from 'post-mortem' to, anything else before the event actually is a post-mortem.

sengork10y ago

Networking issues in either the storage or communication subsystems of any platform normally result in wide-spread disruptions.

itaifrenkel10y ago

What is the reason different GCE regions use the same IP blocks?

hvass10y ago

What is defense in depth? It is mentioned as a core principle.

koalaman10y ago

https://en.wikipedia.org/wiki/Defense_in_depth_(computing)

awinter-py10y ago

chaos monkey?

hsod10y ago

> Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.

Perhaps the progressive rollout should wait for an affirmative conclusion instead of assuming no news is good news? I'm not being snarky, there may be some reason they don't do this.

windwake1210y ago

Presumably it received a false positive (or it was interpreted as such). This really seems like the root cause, and I suspect a case of happy path engineering striking again.

contingencies10y ago

TLDR; they simply didn't test their (global!) custom route announcement management software. An edge case was triggered in production, and they gee-whiz-automatically went offline. Epic fail.

PS. To the downvoters, truth hurts.

mmel10y ago

I think you're getting downvoted due to the snarky tone more than any "truth" you are stating.

contingencies10y ago

Well, how to phrase the same thing briefly without sounding snarky?

1 more reply

chj10y ago

Upvoted. I think they should put a soft version of this right on the first line, instead of burying it in an ocean of "harmless", "previously unseen" text dances.

herrvogel-10y ago

A bit of topic, but it really bugs me, that the banner on the top so pixilated.

qaq10y ago

DRY "The inconsistency was triggered by a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management."

senderista10y ago

Yes, DNS was clearly designed by idiots who had never heard of DRY.

fixermark10y ago

DRY is tougher when for practical reasons data must be physically cached locally.

NetStrikeForce10y ago

I think most people are missing the main failure point: Why does one change propagate automatically to all regions?

All this could have been contained if they deployed changes on different regions at different times. That would also help with screwing less your overseas users by running a maintenance at 10am their local time :-)

aiiane10y ago

> These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly, and a progressive rollout which makes changes to only a fraction of sites at a time, so that a novel failure can be caught at an early stage before it becomes widespread. In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.

The system does do progressive rollouts, which are essentially what you are referring to (albeit perhaps at a different pace). The number of changes being rolled out means that it's not really feasible to hand roll out configurations to different regions, so the checks are automated. In this case, the automated checks failed as well.

senderista10y ago

Waiting a longer time between regional rollouts (so monitoring systems would have time to detect serious failures) would sacrifice deployment latency, but not deployment throughput (assuming deployments can be made in parallel). For continuous deployment, throughput really matters more than latency.

NetStrikeForce10y ago

I'm not sure you really understand what I've tried to say, but it's probably my fault because of my poor grasp of the English language.

You are just confirming my previous comment. Your rollouts are automated, so pushing a change automatically configures every region, instead of configuring just one and maybe waiting for a prudential time in human scale before the next one because, surprise!, shit happens.

I understand your colleagues probably make lots of changes, but if that introduces risks of global outages IMHO you should reconsider your strategy.

And I'm not sure why you downvoted my previous comment. It's a perfectly valid observation, based on the published information.

j / k navigate · click thread line to collapse

350 comments

brianwawok10y ago

This is a very good Post-Mortem.

As I assumed it was kind of a corner case bug meet corner case bug met corner case bug.

This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?

Making software is hard....

ben_jones10y ago

Self driving cars don't have to be perfect. They just have to be safer then driving is today [1].

[1]: https://en.wikipedia.org/wiki/List_of_motor_vehicle_deaths_i...

gliderShip10y ago

6 more replies

eslaught10y ago

At least with human drivers, the failures are generally uncorrelated.

harryf10y ago

> The real question is if society can handle the unfairness that is death by random software error vs. death by negligent driving.

1 more reply

flippyhead10y ago

Fiahil10y ago

I'm very curious to see if our understanding (as a society) of our own technology will improve over time or if people will continue to blame the internet for "not working" 20 years from now.

pyre10y ago

1 more reply

SubuSS10y ago

We see this ALL the time with ALL the big companies including the ones I have worked for in the past. I am very interested in possible solutions people are cooking up here.

tambourine_man10y ago

There's also the ethical problem of life and death choices that will have to programmed in advance.

Most people wouldn't blame the outcome of a split second decision made in heat of the moment but would take issue when the action is deliberate.

Interesting times we live in

xanderjanz10y ago

I imagine we'd treat it like we do getting hit by a drunk driver. Vilify reckless programmers, and don't think about it.

1 more reply

llamataboot10y ago

No because negligent driving doesn't just put the driver at risk - it puts everyone else on the road plus pedestrians at risk

1 more reply

amelius10y ago

> Self driving cars don't have to be perfect. They just have to be safer then driving is today

But how is Google or any other manufacturer going to test their software updates? Are they going to test-drive their cars for tens of thousands of miles over and over again for every little update?

2 more replies

794CD0110y ago

2 more replies

nzoschke10y ago

Seconded. Fast recovery of the problem, fast to publish a postmortem, and a very thurough postmortem.

Outages suck, but are inevitable even for Google. With a response like this Google has gained even more trust from me.

kyrra10y ago

Pair this with the outage tracking tools and you can find all the outages that have happened across Google and what caused them.

[0] http://queue.acm.org/detail.cfm?id=2371516

The opinions stated here are my own, not necessarily those of Google.

Edit: Changed from saying "all" to "most" postmortems being available to Googlers to see.

pveierland10y ago

marcosdumay10y ago

Just like planes.

I'll have to point that this is necessary, but not sufficient for enabling an ever improving, extremely safe activity.

Aviation also have a just right amount of blame running in the system that is hard to replicate on any other area.

xiphias10y ago

What does post-mortem mean in this context? The software one (after an accident) or the human one (after death)? I think It's crazy that the word gets back the original meaning..

reustle10y ago

I wouldn't worry so much. I'm sure self driving cars are going to save a lot more lives than they are going to end. Humans are terrible drivers, and the software will only get better.

brianwawok10y ago

I know this logically. But emotionally I know how many bugs I have written in my life. I know software devs are human.. aka I know how the sausage is made.

2 more replies

ambago10y ago

It's estimated that self-driving cars could reduce vehicle crashes by approximately 90%! [4]

1 more reply

VonGuard10y ago

People suck at driving. Even a shitty self-driving car will save a ton of lives simply by obeying traffic laws.

6 more replies

llamataboot10y ago

fizzbatter10y ago

Yup. People seem to be overly critical with automated car failures.

Personally, i think automated cars are going to easily be better than humans in the working cases (both human and ai are concious). Next, i expect to see fully operational backup systems.

1 more reply

nxzero10y ago

Idea that edge cases in autonomous vehicles would result in 30,000+ deaths a year to me is a stretch.

If you dispute this, please explain.

If your position is that one death is too many, that is illogical relative to the option of letting people drive cars.

ocdtrekkie10y ago

I'm not saying it won't get better, but pretending self-driving cars is a cure-all right now is hilarious and insane.

1 more reply

Dylan1680710y ago

> As I assumed it was kind of a corner case bug meet corner case bug met corner case bug.

lugg10y ago

This wasn't an edge case. It was two bugs in two sections of code both designed to recover from a serious problem. It sounds like both sections of code were not tested properly at the very least.

Sounds to me like someone just didn't bother to test the failsafe part of the code.

packetslave10y ago

...and you're basing this on what, exactly? It's easy to pontificate about what people "didn't bother to test" based on zero information.

1 more reply

Thaxll10y ago

self driving cars will be safer than BGP that' s for sure.

clebio10y ago

Oh geez. This comment needs to go into the pantheon of all-time nerdy humor.

eloff10y ago

Yes but how many people drive stoned, drunk, or distracted?

How many people drive aggressively, speeding, or erratically? How many people do dumb things on the road?

Who would you rather share the road with, computer drivers that drive like your grandma, or a bunch of humans? It's a no-brainer right?

ddispaltro10y ago

Agreed, I think a good postmortem distinguishes great companies from just good companies. The depth on philosophy, reasoning and then action is very digestible.

fweespee_ch10y ago

> This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?

Yes. However, the current failure rate of human drivers being improved on is the standard I care about.

http://www.cnbc.com/2015/10/29/crash-data-for-self-driving-c...

That is the only number that matters to me. Google gets that to 4.0 per million miles and I'd say they are good to go.

brianwawok10y ago

So crashes like that include drunk drivers, drugged drivers, and texting drivers.

3 more replies

erichocean10y ago

dsl10y ago

First of all, BGP is core to Google's load balancing architecture. So within a single datacenter you probably have at least a few dozen devices down stream from each edge router.

Secondly, I'm seeing just shy of 500 individual prefixes, 282 directly connected peers (other networks), and a presence at over 100 physical internet exchanges, just for one of Google's four ASes.

Would you be able to read over that configuration and tell me if it has errors?

1 more reply

DanielDent10y ago

Any sufficiently large system quickly reaches a point where a human has difficulty tracking what the system should look like.

Google has at least tens of data center locations, each of which will have multiple physical failure domains.

There are also many discontiguous routes being announced at all of their network PoPs. They have substantially more PoPs than data centers.

It very quickly gets too much to reasonably expect people to be able to keep track of what the system should look like, let alone grasping what it does look like.

jvolkman10y ago

The human part of any process will also eventually fail, and it's much more difficult to fix human bugs. Better to shoot for full automation.

heisnotanalien10y ago

But a computer could do the same thing. It would be possible to alert on all the routes being blown away?

stcredzero10y ago

This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?

Formal systems?

tremon10y ago

robmcm10y ago

Having said that I am still scared, I'm not sure how well Tesla auto pilot will handle a tire blowout at 70mph. Perhaps better than I would, but I would much rather I was in control.

mikx00710y ago

jfoster10y ago

As long as the edge case bugs in self driving cars come up less frequently than human error, it's an overall improvement.

ocdtrekkie10y ago

Currently, they don't. Google cars fail every 1,500 miles on average.

2 more replies

yeukhon10y ago

Donzo10y ago

A 1-in-one-million occurrence will happen a thousand times each day when you perform a billion operations.

msellout10y ago

Humans also make mistakes in corner cases. I'm no more afraid of an auto-steering car than a human-steered car.

ksou3210y ago

Humans have bugs all the time, you can just faint for no reason while driving . ..

profeta10y ago

why do you think that will wait for cars? https://en.wikipedia.org/wiki/Therac-25

draw_down10y ago

Well, humans driving cars is already a disaster, so.

teraflop10y ago

> There are a number of lessons to be learned from this event -- for example, that the safeguard of a progressive rollout can be undone by a system designed to mask partial failures -- ...

This is a really important point that should be more generally known. To quote Google's own "Paxos Made Live" paper, from 2007:

nostrademons10y ago

[1] http://everythingsysadmin.com/2012/09/devops-google-reveals-...

campers10y ago

There was an relevant section in the Google SRE book notes posted here the other day about injecting faults into Chubby, their distributed lock service, which was too reliable.

LaFolle10y ago

Direct link to DiRT => http://queue.acm.org/detail.cfm?id=2371516

I remember my founder (ex Googler) telling us about fault injection at Google. We were pretty amazed by the idea. Thanks for the link @nostrademons.

peterwwillis10y ago

1 more reply

atomic7710y ago

[1] At least my interpretation of them

woodman10y ago

> So, has anyone managed to make progress toward a "systematic solution" in the last 9 years?

That depends on how you define "solution". If development time isn't a concern, then formal verification is a pretty solid solution. AWS has used TLA+ on a subset of its systems. [0]

[0] https://en.wikipedia.org/wiki/TLA%2B

brandmeyer10y ago

The standard solution in realtime safety-critical systems is to perform health monitoring in addition to robust fallbacks, such that when the system is falling back, it is reported as unhealthy.

nostrademons10y ago

dfox10y ago

nkristoffersen10y ago

The best example so far regarding fault tolerance. The Netflix Simian Army. Introduce failures constantly!

http://techblog.netflix.com/2011/07/netflix-simian-army.html

pmarreck10y ago

> So, has anyone managed to make progress toward a "systematic solution" in the last 9 years?

Allow me to introduce you to the fantastic and battle-tested http://learnyousomeerlang.com/what-is-otp , preferably utilized (IMHO) via http://elixir-lang.org/

strictfp10y ago

Degraded modes of operation is one example of how to visualize masked errors. Another is to trigger an alarm on fallbacks.

Aggregating logs to a central location and being able to analyze global behaviour in retrospect is also a great feature.

senderista10y ago

Great point! More precisely, each state transition in a system should report the old state, the new state, and the triggering event (cause) to a monitoring system (possibly just a log).

cosud10y ago

Great writeup! PS: "To make error is human. To propagate error to all server in automatic way is devops." -DevOps Borat

cjbprime10y ago

It looks like there were at least three catastrophic bugs present:

1. Evaluated a configuration change before the change had finished syncing across all configuration files, resulting in rejecting the change.

2. So it tried to reject the change, but actually just deleted everything instead.

3. Something was supposed to catch changes that break everything, and it detected that everything was broken but its attempt to do anything to fix it failed.

It is hard to imagine that this system has good test coverage.

mjibson10y ago

I'm attempting to even imagine how one would build a useful way to test this. Would they have to have a secondary, world-wide datacenter network with all their various services behind it?

Rapzid10y ago

That doesn't mean that bugs can't creep in. Who knows, maybe these were all extremely unlikely bugs and Google hit an astronomically unlikely bad-luck streak. Happens.

awinter-py10y ago

better test -- if you have seldom-exercised edge-case functionality that you can't figure out how to test, remove those features.

ikeboy10y ago

2 more replies

pmarreck10y ago

manquer10y ago

3 more replies

drstewart10y ago

Seriously. This is a good postmortem, but these are hardly edge case bugs. In this case, major critical functionality just plain didn't work. Kind of shocking.

kevan10y ago

1 more reply

pmarreck10y ago

The take-home here is: Unit-test your failure states as well, people. Not just your happy paths!

I mean, this problem was a result of MULTIPLE untested failure states.

EDIT: Why downvotes without a typewritten rebuttal? That's just not what I expect from HN (as opposed to, say, Reddit)

agnivade10y ago

I feel 1 is a timing related issue. Which are extremely hard to get right, and might happen.

2 and 3 shouldn't have happened. But since they aren't releasing any further details. It would be unfair to rate the system.

specialist10y ago

Spit balling here...

For progressive rollouts, what if config changes where pulled instead of pushed?

Each system would be responsible for itself updating, verifying (canary, smoketest, make sure other systems successfully updated, etc), bouncing, and then rolling back as needed.

nostrademons10y ago

stcredzero10y ago

fixermark10y ago

stcredzero10y ago

The whole point of the canary sub-population, though is that 1) It's not your whole population. 2) You want to find out empirically if something's wrong.

avs73310y ago

this was my exact thought...it would seem both feasable and reasonable to have a more active canary process i.e....

anycast "canary test in progress"

edge routers store new configs

anycast "canary test PASS"

edge routers activate new config

edge routers canary test new config (and pass or revert)

edge routers report home that all is well

Gravityloss10y ago

I'm waiting for the time when they push over the air updates to airplanes in flight.

"You can fly safely, we have canaries and staged deployment"

A year forward:

I'm not 100% sure why they don't do the staged deployment for google scale server networking over a few days (or even weeks in some cases) instead of a few hours, but I don't know the details here...

It's good that they had manually triggerable configuration rollback possibility and a pre-set policy so it was solved so quickly.

kentonv10y ago

As a founder of a startup that hosts services on GCE I'm happy with the trade-off they've chosen.

nxzero10y ago

Comparing the risk of a live update to a system lives depend on to the risk of some Google services going down is irrational.

At some point, delaying the deployment of updates system wide would cause more, not less risks.

Gravityloss10y ago

There are businesses that fit somewhere between Boeing and Spotify where failures still have some kind of steeper than casual cost.

1 more reply

mgrennan10y ago

It's OK - We (developers) are not liable for software bugs!

joering210y ago

Yeah ,the part with canary code rub me the wrong way too.

These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly

This sounds very unprofessional imho. "Touch this cable to see if there is electricity running" sort of thing.

Is that really how its should be done?

oconnor66310y ago

If you're doing electrical work, eventually you're going to have to touch the cable!

grogers10y ago

senderista10y ago

Yes. In a sufficiently complex environment, it's impossible to avoid deploying bugs to production. You can only hope to mitigate their impact.

ndesaulniers10y ago

They're a good learning exercise writing one, and is more of a learning exercise than a punishment.

advisedwang10y ago

Source: I work on the team that writes these external postmortems.

jpatokal10y ago

Google publishes a public incident report for all service outages (code red) in the Cloud status dashboard. You can see some in the History page: https://status.cloud.google.com/summary

Sample: https://status.cloud.google.com/incident/appengine/16002

Note that the length of the report tends to correlate with the severity of the outages and that disruptions (code orange) disruptions do not get reports.

Disclaimer: I work in Cloud Support and write some of these.

dylanz10y ago

OldSchoolJohnny10y ago

Especially considering that nearly every post on HN features an often tangential first comment that goes on and on and on...

ikeboy10y ago

I assume the software was originally tested to make sure it works in case of failure. It would be interesting to know exactly what the bug was and why it didn't show in tests.

djfergus10y ago

However, this is a great detailed post-mortem from a service provider. Your Telco or ISP will never provide this much detail...

pjlegato10y ago

Attention startups: this is what incident post-mortems should look like.

wyldfire10y ago

Many startups probably don't have anywhere near the same level of SLA nor revenue of GCE.

pjlegato10y ago

Of course. You should still have a goal to strive towards.

eranation10y ago

ar010y ago

poooogles10y ago

Swannie10y ago

I was going to post the same comment too.

My understanding, from the odd bits and bobs of information I have, is that AWS regions are typically managed somewhat independently.

cavisne10y ago

AWS dont attempt to do anycast, which is annoying but also means this can't happen

wyldfire10y ago

toomuchtodo10y ago

> But I wonder if the fix time could be improved upon?

Rushing to enact a solution can sometimes exacerbate the problem.

VLM10y ago

Having worked in ISP operations on BGP stuff (admittedly more than 10 years ago), it was both too slow and too fast.

djfergus10y ago

1 more reply

Sanddancer10y ago

jlgaddis10y ago

clebio10y ago

This sounds like hindsight bias.

wyldfire10y ago

Well, perhaps, but -- to be clear, I'm not suggesting that there is a problem, rather gathering more information in order to determine whether there's a more optimal solution.

But that's all moot, if "fastest possible recovery [given XYZ constraints of BGP or whatever system] is ~16 minutes." Which it sounds like may indeed be the case.

obulpathi10y ago

These credits exceed what is promised by Google Cloud in their SLA's for Compute Engine and VPN service!

duskwuff10y ago

... which is precisely (almost word-for-word) what the post-mortem goes on to say. Is there something specific you're trying to call attention to here?

obulpathi10y ago

Nop. Probably did too much copy-pasting :( Mearly wanted to highlight the point.

icebraining10y ago

Only barely. They're down to 2.5 minutes of downtime left for the next 30 days if they want to keep the 99.95% level.

mikecb10y ago

They do SLO by quarter.

1 more reply

balls18710y ago

Nice post mortem.

That outtage gives GCE at best a four 9's reliability for 2016.

daveguy10y ago

Based on the higher level status page:

https://status.cloud.google.com/summary

It looks like GCE uptime is well below four 9's reliability for a sliding 1 year timeframe.

dgacmu10y ago

  "On Tuesday 23 February 2016, for a duration of 
   10 hours and 6 minutes, 7.8% of Google Compute Engine
   projects had reduced quotas.  ...  Any resources that
   were already created were unaffected by this issue."

1 more reply

balls18710y ago

April's incident is unique, This was the only case (listed) that was a service outtage, which impacted all of GCE.

The other incidents (as far as I can tell), were service disruptions at the AZ/regional level. Those disruptions don't impact the 9's, as GCE was available for other regions.

huula10y ago

I always like Google's serious attitude towards engineering, even after they have made some mistakes, they never try to hide anything.

totally10y ago

> However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration

> Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process

I'm sure the devil is in the details, but generally speaking, these are 2 instances of critical code that gets exercised infrequently, which is a good place for bugs to hide.

pbreit10y ago

Do SLAs even matter in the slightest? Or are they just sort of "feel-good" things or ways for negotiators to demonstrate their worth?

duskwuff10y ago

SLAs aren't about guaranteeing uptime. They're about setting consequences for downtime.

cbr10y ago

qaq10y ago

heisenbit10y ago

"Lessons learned from reading post-mortems" http://danluu.com/postmortem-lessons/ is a good place to dig deeper

The first graph quoted from a survey paper is a classic fitting the GCE outage well:

Initial error --92%--> Incorrect handling of errors explicitly signaled in software

anoncept10y ago

https://mitpress.mit.edu/books/engineering-safer-world is also an excellent resource that more people who care about post-mortems should read.

simonebrunozzi10y ago

I love his signature: "Benjamin Treynor Sloss | VP 24x7".

rdtsc10y ago

> However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.

Always test your crash / exception handling / special case termination+recovery code in production.

One day power failure happens, service restart and tries to recover. Code that almost never runs, now runs and the whole thing goes into an unknown broken state.

senderista10y ago

See https://en.wikipedia.org/wiki/Crash-only_software

halayli10y ago

This isn't the first time a config system at Google causes a major outage.

https://googleblog.blogspot.com/2014/01/todays-outage-for-se...

rrdharan10y ago

That's entirely unsurprising. The recent major Facebook outage was also caused by bad configuration IIRC.

See: http://danluu.com/postmortem-lessons/

contingencies10y ago

Great link there! Also check out his list of public postmortems at https://github.com/danluu/post-mortems

PS. On HN you should use asterisks to italicize instead of > for quoting.

DanielDent10y ago

My post yesterday seems even more relevant today: https://news.ycombinator.com/item?id=11477552

It's a shame it's not easier or more common for people to create clones of (most|all) of their infrastructure for testing purposes.

Something like half of outages are caused by configuration oopsies.

aiiane10y ago

DanielDent10y ago

While I agree with you, I think we could get closer to "production" than is common right now.

zaroth10y ago

Pulling your own worldwide routes because you have too much automation; it will make a good story once it's filtered down a bit! Icarus was barely up in the air, too early for a fall.

swills10y ago

The thing that stood out for me was:

"...team...worked in shifts overnight..."

delroth10y ago

(Usual disclaimer: I speak for myself, not for my employer, etc.)

senderista10y ago

grogers10y ago

trhway10y ago

mjevans10y ago

I hope that one of their solutions is the obvious one; make change control testing a closed loop instead of an open loop. (Watch for /success/ reported instead of failure notification.)

platz10y ago

> configuration file

configuration files strike again - remember knight capital?

nickysielicki10y ago

What does Google use for BGP? Quagga, OpenBGPD, BIRD, their own?

Also, does anyone have a link to statistics on global BGP software usage? I'm curious what the marketshare looks like.

kijiki10y ago

Google has contributed ISIS and BGP code to Quagga in the past, as well as funding some testing at the OSRF. Presumably they use it in at least some parts of their operations.

Tistel10y ago

The postmortem used the word "quirk." They might consider drilling down on the specifics there. Especially if that is the heart of the bug/accident.

JustUhThought10y ago

Just a thought. Maybe change the name from 'post-mortem' to, anything else before the event actually is a post-mortem.

sengork10y ago

Networking issues in either the storage or communication subsystems of any platform normally result in wide-spread disruptions.

itaifrenkel10y ago

What is the reason different GCE regions use the same IP blocks?

hvass10y ago

What is defense in depth? It is mentioned as a core principle.

koalaman10y ago

https://en.wikipedia.org/wiki/Defense_in_depth_(computing)

awinter-py10y ago

chaos monkey?

hsod10y ago

> Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.

Perhaps the progressive rollout should wait for an affirmative conclusion instead of assuming no news is good news? I'm not being snarky, there may be some reason they don't do this.

windwake1210y ago

Presumably it received a false positive (or it was interpreted as such). This really seems like the root cause, and I suspect a case of happy path engineering striking again.

contingencies10y ago

TLDR; they simply didn't test their (global!) custom route announcement management software. An edge case was triggered in production, and they gee-whiz-automatically went offline. Epic fail.

PS. To the downvoters, truth hurts.

mmel10y ago

I think you're getting downvoted due to the snarky tone more than any "truth" you are stating.

contingencies10y ago

Well, how to phrase the same thing briefly without sounding snarky?

1 more reply

chj10y ago

Upvoted. I think they should put a soft version of this right on the first line, instead of burying it in an ocean of "harmless", "previously unseen" text dances.

herrvogel-10y ago

A bit of topic, but it really bugs me, that the banner on the top so pixilated.

qaq10y ago

senderista10y ago

Yes, DNS was clearly designed by idiots who had never heard of DRY.

fixermark10y ago

DRY is tougher when for practical reasons data must be physically cached locally.

NetStrikeForce10y ago

I think most people are missing the main failure point: Why does one change propagate automatically to all regions?

aiiane10y ago

senderista10y ago

NetStrikeForce10y ago

I'm not sure you really understand what I've tried to say, but it's probably my fault because of my poor grasp of the English language.

I understand your colleagues probably make lots of changes, but if that introduces risks of global outages IMHO you should reconsider your strategy.

And I'm not sure why you downvoted my previous comment. It's a perfectly valid observation, based on the published information.

j / k navigate · click thread line to collapse