undefined | Better HN

0 pointsdjfergus10y ago0 comments

Hmm. Seems like this begs for a different way to solve the problem, like alarming on major changes to configuration files or better recognition of invalid configs, i.e. google should be able to make a rule that says "if I ever blackhole x% of my network then alarm"...

0 comments

VLM10y ago

The first one is alarm fatigue. Like the "Terror Thermometer" or whatever its called where we're in eternal mauve alert meaning nothing to anyone. All our changes are color coded as magenta now. Or its turned down such that one boring little ip block isn't a major change. After all, it isn't. Of course you (us) developers could run crazy important multinational systems on what to us networking guys was one boring little IP block who cares about such as small block of space.

The second one is covered in the article, their system for that purpose crashed and then the system that babysits that crashed and then whatever they use to monitor the monitors monitor system didn't notice. Probably showed up in some dude's nightly syslog file dump the next day. Oh well. If your monitor tool breaks due to complexity (as they often do) it needs to simplicate and add lightness not slather more complexity on. Usually monitoring is more complicated and less reliable than operating, its harder computationally and procedurally to decide right from wrong than to just do it.

The odds of cascaded failure happening are very low. Given fancy enough backup systems that means all problems will be weird cascaded failure modes. That might be useful in training.

When I was doing this kind of stuff I was doing higher level support so see above at least some of my stories are weird cascaded impossible etc. A slower rollout would have saved them, working all by myself I like to think I could have figured it out by comparing BGP looking glass results and traceroute outputs from multiple very slowly arriving latency reports to router configs with papers all over my desk and multiple monitors in at most maybe two days. Huh, its almost like anycast isn't working at more sites every couple hours, huh. Of course their automated deployment is complete in only 4 hours, which means all problems that take "your average dude" more than 4 hours of BAU time to fix are going to totally explode the system and result in headlines instead of a weird bump on a graph somewhere. Given that computers are infinitely patient, slowing down the rollout of automated deployments from 4 hours to 4 days would have saved them for sure. Don't forget that normal troubleshooting places will blow the first couple hours on written procedures and scripts because honestly most of the time those DO work. So my ability to figure it out all by myself in 24 hours is useless if the time from escalation to hit the fan was only an hour because they roll out so fast. Once it hit the fan a total company effort fixed it a lot faster than I could have fixed it as an individual.

Or the strategy I proposed where computers are also infinitely fast, roll out in five minutes, one minutes to say WTF, five minutes to roll back, 11 minute outage is better than what they actually got. Its not like google is hurting for computational power. Or money.

I'm sure there are valid justifications for the awkward four hour rollout thats both too fast and too slow. I have no idea what they are, but the google guys probably put some time into thinking about it.

j / k navigate · click thread line to collapse

0 comments

VLM10y ago

The odds of cascaded failure happening are very low. Given fancy enough backup systems that means all problems will be weird cascaded failure modes. That might be useful in training.

j / k navigate · click thread line to collapse