1. Their bot management system is designed to push a configuration out to their entire network rapidly. This is necessary so they can rapidly respond to attacks, but it creates risk as compared to systems that roll out changes gradually.
2. Despite the elevated risk of system wide rapid config propagation, it took them 2 hours to identify the config as the proximate cause, and another hour to roll it back.
SOP for stuff breaking is you roll back to a known good state. If you roll out gradually and your canaries break, you have a clear signal to roll back. Here was a special case where they needed their system to rapidly propagate changes everywhere, which is a huge risk, but didn’t quite have the visibility and rapid rollback capability in place to match that risk.
While it’s certainly useful to examine the root cause in the code, you’re never going to have defect free code. Reliability isn’t just about avoiding bugs. It’s about understanding how to give yourself clear visibility into the relationship between changes and behavior and the rollback capability to quickly revert to a known good state.
Cloudflare has done an amazing job with availability for many years and their Rust code now powers 20% of internet traffic. Truly a great team.
How can you write the proxy without handling the config containing more than the maximum features limit you set yourself?
How can the database export query not have a limit set if there is a hard limit on number of features?
Why do they do non-critical changes in production before testing in a stage environment?
Why did they think this was a cyberattack and only after two hours realize it was the config file?
Why are they that afraid of a botnet? Does not leave me confident that they will handle the next Aisuru attack.
I'm migrating my customers off Cloudflare. I don't think they can swallow the next botnet attacks and everyone on Cloudflare go down with the ship, so it will be safer to not be behind Cloudflare when it hits.
That's often the case with human error as especially aviation safety experts know: https://en.wikipedia.org/wiki/Swiss_cheese_model
Any big and noticeable incident is one of the "we failed on so many levels here" kind, by definition.
Isn’t getting cyberattacked their core business?
I guess the noncritical change here was the change to the database? My experience has been a lot of teams do a poor job having a faithful replica of databases in stage environments to expose this type of issue.
Permissions stuff might be caught without a completely faithful replica, but there are always going to be attributes of the system that only exist in prod.
Yet you omit to acknowledge that the remaining 99.99999% logic written that powers Cloudflare works flawlessly.
Also, hindsight is 20/20
A system that is 99.99999% flawless, can still be unusable.
optimism bias: 100/100
Having an unprivileged application querying system.columns to infer the table layout is just bad; Not having a proper, well-defined table structure indicates sloppiness in the overall schema design, specially if it changes quickly. Considering specifically clickhouse, and even if this approach would be a good idea, the unprivileged way of doing it would be "DESCRIBE TABLE <name>", NOT iterating system.columns. The gist of it - sloppy design not even well implemented.
Having a critical application issuing ad-hoc commands to system.* tablespace instead of using a well-tested library is just amateurism, and again - bad engineering; IMO it is good practice to consider all system.* privileged applications and ensure their querying is completely separate from your application logic; Sometimes some system tables change, and fields are added and/or removed - not planning for this will basically make future compatibility a nightmare.
Not only the problematic query itself, but the whole context of this screams "lack of proper application design" and devs not knowing how to use the product and/or read the documentation. Granted, this is a bit "close to home" for me, because I use ClickHouse extensively (at a scale - I'm assuming - several orders of magnitude smaller than CloudFlare) and I have spent a lot of time designing specifically to avoid at least some of these kind of mistakes. But, if I can do it at my scale, why aren't they doing it?
The database issue screamed at me: lack of expertise. I don't use CH, but seeing someone to mess with a production system and they being surprised "Oh, it does that?", is really bad. And this is obviously not knowledge that is hard to achieve, buried deep in a manual or an edge case only discoverable by source code, it's bread and butter knowledge you should know.
What is confusing, that they didn't add this to their follow-up steps. With some benefit of doubt I'd assume they didn't want to put something very basic as a reason out there, just to protect the people behind it from widespread blame. But if that's not the case, then it's a general problem. Sadly it's not uncommon that components like databases are dealt with, on an low effort basis. Just a thing we plug in and works. But it's obviously not.
But the case for Cloudflare here is complicated. Every engineer is very free to make a better system though.
Cloudflare builds a global scale system, not an iphone app. Please act like it.
They explain that at some length in TFA.
Is that an overreaction?
Name me global, redundant systems that have not (yet) failed.
And if you used cloudflare to protect against botnet and now go off cloudflare... you are vulnerable and may experience more downtime if you cannot swallow the traffic.
I mean no service have 100% uptime - just that some have more nines than others.
I do like the flat cost of Cloudflare and feature set better but they have quite a few outages compared to other large vendors--especially with Access (their zero trust product)
I'd lump them into GitHub levels of reliability
We had a comparable but slightly higher quote from an Akamai VAR.
But at the same time, what value do they add if they:
* Took down the the customers sites due to their bug.
* Never protected against an attack that our infra could not have handled by itself.
* Don't think that they will be able to handle the "next big ddos" attack.
It's just an extra layer of complexity for us. I'm sure there are attacks that could help our customers with, that's why we're using them in the first place. But until the customers are hit with multiple ddos attacks that we can not handle ourself then it's just not worth it.
It's not that many things had to fail, it's that many things that are obvious haven't been done. It would be a valid excuse if many "exotic" scenarios would have to align, not when it's obvious error cases that weren't handled and changes have not been tested.
While having wrong first assumptions is just how things work when you try to analyze the issue[1], not testing changes before production is just stupidity and nothing else.
The story would be different if eg. multiple unlikely, hard to track things happened at once without anyone making a clearly linkable event, something that would also happen in staging. Most of the things mentioned could essentially statically checked. This is the prime example of what you want as any tech person, because it's not hard to prevent compared to a lot of scenarios where you deal with balancing likelihoods of scenarios, timings, etc.
You don't think someone is a great plumber, because they forgot their tools and missed that big hole in the pipe and also rang at the wrong door, because all these things failed. You think someone is a good plumber if they said they would have to go back to fetch a bulky specialized tool, because this is the rare case in which they need it, but they could also do this other thing in this spcific case. They are great plumbers if they tell you how this happened in first place and how to fix it. They are great plumbers if they manage to fix something outside of their usual scope.
Here pretty much all of the things that you pay them for failed. At a large scale.
I am sure this has there are reasons which we don't now about, and I hope that CloudFlare can fix them. Be it management focusing on the wrong things, be it developers not being in the wrong position or annoyed enough to care or something else entirely. However, not doing these things is (likely) a sign that currently they are not in the state of creating reliable systems - at least none reliable enough for what they are doing. It would be perfectly fine if they ran a web shop or something, but if as experienced many other companies rely on you being up or their stuff fails, then maybe you should not run a company with products like "Always Online".
[1] And should make you adapt the process of analyzing issues. Eg. making sure config changes are "very loud" in monitoring. It's one of the most easily tracked thing that can go wrong, and can relatively easily be mapped to a point in time compared to many other things.
I don’t really buy this requirement. At least make it configurable with a more reasonable default for “routine” changes. E.g. ramping to 100% over 1 hour.
As long as that ramp rate is configurable, you can retain the ability to respond fast to attacks by setting the ramp time to a few seconds if you truly think it’s needed in that moment.
https://developers.cloudflare.com/bots/get-started/bot-manag...
Of course, this is all so easy to say after the fact..
"To make error is human. To propagate error to all server in automatic way is #devops"
This saying dates back to 1969: To err is human but to really foul things up requires a computer.
* https://quoteinvestigator.com/2010/12/07/foul-computer/
Also: I know there’s a proverb which says ‘To err is human,’ but a human error is nothing to what a computer can do if it tries.
Once every 5m is not "rapidly". It isn't uncommon for configuration systems to do it every few seconds [0].
> While it’s certainly useful to examine the root cause in the code.
Believe the issue is as much an output from a periodic run (clickhouse query) caused by (on the surface, an unrelated change) causing this failure. That is, the system that validated the configuration (FL2) was different to the one that generated it (ML Bot Management DB).
Ideally, it is the system that vends a complex configuration that also vends & tests the library to consume it, or the system that consumes it, does so as if it was "tasting" the configuration first before devouring it unconditionally [1].
Of course, as with all distributed system failures, this is all easier said and done in hindsight.
[0] Avoiding overload in distributed systems by putting the smaller service in control (pg 4), https://d1.awsstatic.com/builderslibrary/pdfs/Avoiding%20ove...
[1] Lessons from CloudFront (2016), https://youtube.com/watch?v=n8qQGLJeUYA&t=1050
Isn't rapidly more of how long it takes to get from A to Z rather than how often it is performed? You can push out a configuration update every fortnight but if it goes through all of your global servers in three seconds, I'd call it quite rapid.
Let those who have never written a bug before cast the first stone.
Reminds me of House of Dynamite, the movie about nuclear apocalypse that really revolves around these very human factors. This outage is a perfect example of why relying on anything humans have built is risky, which includes the entire nuclear apparatus. “I don’t understand why X wasn’t built in such a way that wouldn’t mean we live in an underground bunker now” is the sentence that comes to mind.
I guess you are right, likely a social issue, but certainly not a single exhausted parent.
The new config file was not (AIUI) invalid (syntax-wise) but rather too big:
> […] That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.
> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
In fact, the root bug (faulty assumption?) was in one or more SQL catalog queries that were presumably written some time ago.
(Interestingly the analysis doesn’t go into how these erroneous queries made it into production OR whether the assumption was “to spec” and it’s the security principal change work that was faulty. Seems more likely to be the former.)
If this was a routine config change, I could see how it could take 2 hours to start the mediation plan. However they should have dashboards that correlate config setting changes with 500 errors (or equivalent). It gets difficult when you have many of of these going out at the same time and they are slowly rolled out.
The root cause document is mostly for high level and the public. The details on this specific outage will be in a internal document with many action items, some of them maybe quarter long projects including fixing this specific bug and maybe some linter/monitor to prevent it from happening again.
In a productive way, this view also shifts the focus to improving the system (visibility etc), empowering the team, rather than focusing on the code which broke (probably strikes fear in the individuals, to do anything!)
I'm sure that there are misapplied guidelines to do that instead of being nice to incoming bot management configuration files, and someone might have been scolded (or worse) for proposing or attempting to handle them more safely.
If every time there's a new bot someone needs to write code that can blow up their whole service, maybe they need to iterate a bit on this design?
That and why the hell wasn't their alerting showing up colossal amount of panics in their bot manager thing?
This is also a pretty good example why having stack traces by default is great. That error could have been immediately understood just from a stack trace and a basic exception message.
And yes, there is a lint you can use against slicing ('indexing_slicing') and it's absolutely wild that it's not on by default in clippy.
[lints.clippy]
dbg_macro = "deny"
unwrap_used = "deny"
expect_used = "deny"This is sobering.
My new fear is some dependency unwrap()ing or expect()ing something where they didn't prove the correctness.
Unwrap() and expect() are an anti-pattern and have no place in idiomatic Rust code. The language should move to deprecate them.
As the user, I can't tell the difference, but it might have sped up their recovery a bit.
I imagine it would also require less time debugging a panic. That kind of breadcrumb trail in your logs is a gift to the future engineer and also customers who see a shorter period of downtime.
Both are important, and I am pretty sure, that someone is gonna fix that line of code pretty soon.
This is the danger of automated control systems. If they get hacked or somehow push out bad things (CloudStrike), they will have complete control and be very efficient.