In a failure case, it should remove the failing config, not all of them.
Pretty hard thing to miss if you test for it with any level of basic unit test or similar.
Second bug: canary failure should prevent further propogation of the bad config.
A little more difficult to test with automated tests due to requiring a connection. It sounds like this was in fact tested, but the usage between the two bits of software was not tested. A good integration test would have caught this. But I wouldn't call that required. I would at least however think it was required that the use case of that particular code to be at least manually checked because, you know it's a feature for disaster prevention / recovery.
There was enough information to deduce this pretty easily. Although they did tend to glaze over it in the write-up, almost purposefully.
For all those spouting that this was a good postmortem, not really, it's a good covering of ones ass, a good spin, sidestepping the real root cause.
What has slas and "here take credits" got to do with a postmortem?
I'm not really sure why I got downvoted for this. The post mortem was good but it wasn't something I'd aim to strive for. I like gcloud and I'll keep using it but I find the response to this thing a little bit hard to swallow.
Because you have an apparently incredibly simple mental model for the system and so of course tests for it seem simple?
I don't doubt that Google's infrastructure is as complicated and nuanced as it can get. Configuration software just simply isn't.
I still don't really see the point you're trying to make here. There isn't enough detail in the two sentences they gave us on the actual cause of the problem to really say much more in any further detail than I did.
But I guess that just proves my other point. Postmortem was 90% fluff.
Yet in googles defence, the information they gave was thorough enough for me. My only gripe was how it was being treated here. It just wasn't a very interesting situation and turned out to be something quite mundane.