If this was sent out at AWS as a COE (postmortem), it would be ripped apart - it is not going to satisfy anyone reading it that they should have confidence this class of failure isn't going to happen again. It looks like they haven't even identified the root cause(s) of the failure...
> Efforts continue to find the incompatibility in the networking configuration change. Additionally, we are exploring improvements to our tools and processes to facilitate a finer grained, more incremental deployment method for wide, system-level changes.
I would have liked to hear more about how they are going to reduce the blast radius of such a change, because it sounds like something that could have been deployed to a single datacenter first
Trust and transparency are the currencies of the internet, in the same way that cigarettes and contraband are the currencies in prison. This post is worth approx. a 1/2 smoked cigarette.
> The outage was triggered as a result of a networking configuration change on the Block Storage clusters to improve handling packet loss scenarios. The new setting caused incompatibilities
So that doesn't tell us very much about the cause ("a networking configuration change") nor the effect ("incompatibilities").
I wonder if this just means someone changed an MTU configuration and it led to tons of fragmentation in different components of the network, and especially for any large file transfer making things timeout constantly to render an outage. Just a wild guess, but I’ve seen this happen with in-house datacenters before, so perhaps.