They key observations:
"Due to Facebook stopping announcing their DNS prefix routes through BGP, our and everyone else's DNS resolvers had no way to connect to their nameservers. Consequently, 1.1.1.1, 8.8.8.8, and other major public DNS resolvers started issuing (and caching) SERVFAIL responses.
But that's not all. Now human behavior and application logic kicks in and causes another exponential effect. A tsunami of additional DNS traffic follows.
This happened in part because apps won't accept an error for an answer and start retrying, sometimes aggressively, and in part because end-users also won't take an error for an answer and start reloading the pages, or killing and relaunching their apps, sometimes also aggressively."
I'm certainly guilt of this. Retries make the world go round, and round again. I've been given attitude by teams that own downstream services.
Them: "Why are you retrying so aggressively?" Me: "Why is your service so damn flakey?"
Depends on the rate I would think:
(And that sounds like you giving, rather than being given, attitude.)
-some tester I know
[1] https://engineering.fb.com/2021/05/13/data-center-engineerin...
[2] https://www.usenix.org/conference/nsdi21/presentation/abhash...
- someone big redistributed their static routes for FB into their announcements to peers.
- someone who has mapped peer filters and their prefix lengths has figured out how to announce smaller prefixes for FB routes and have them propagate.
- someone with enable somewhere in one of the major ASNs (like 701 back in my day etc) is doing a straight forward attack on FB.
- someone inside FB messed with load balancing and prepended a bunch of their routes internally and redistributed the long AS paths themselves and just broke shit with internal routing loops.
I have no idea how people unbefunge routing problems now that you have to coordinate multiple teams on the phone to get anything done instead of just one router guru just logging into everything and fixing it. I would be useless at it now, but this is not a recent problem. If it's still a problem, it will always be a problem.
Excellent. Just what I like to hear /s