undefined | Better HN

0 pointsabalone6mo ago0 comments

I’ve led multiple incident responses at a FAANG, here’s my take. The fundamental problem here is not Rust or the coding error. The problem is:

1. Their bot management system is designed to push a configuration out to their entire network rapidly. This is necessary so they can rapidly respond to attacks, but it creates risk as compared to systems that roll out changes gradually.

2. Despite the elevated risk of system wide rapid config propagation, it took them 2 hours to identify the config as the proximate cause, and another hour to roll it back.

SOP for stuff breaking is you roll back to a known good state. If you roll out gradually and your canaries break, you have a clear signal to roll back. Here was a special case where they needed their system to rapidly propagate changes everywhere, which is a huge risk, but didn’t quite have the visibility and rapid rollback capability in place to match that risk.

While it’s certainly useful to examine the root cause in the code, you’re never going to have defect free code. Reliability isn’t just about avoiding bugs. It’s about understanding how to give yourself clear visibility into the relationship between changes and behavior and the rollback capability to quickly revert to a known good state.

Cloudflare has done an amazing job with availability for many years and their Rust code now powers 20% of internet traffic. Truly a great team.

0 comments

polack6mo ago

They failed on so many levels here.

How can you write the proxy without handling the config containing more than the maximum features limit you set yourself?

How can the database export query not have a limit set if there is a hard limit on number of features?

Why do they do non-critical changes in production before testing in a stage environment?

Why did they think this was a cyberattack and only after two hours realize it was the config file?

Why are they that afraid of a botnet? Does not leave me confident that they will handle the next Aisuru attack.

I'm migrating my customers off Cloudflare. I don't think they can swallow the next botnet attacks and everyone on Cloudflare go down with the ship, so it will be safer to not be behind Cloudflare when it hits.

huijzer6mo ago

> They failed on so many levels here.

That's often the case with human error as especially aviation safety experts know: https://en.wikipedia.org/wiki/Swiss_cheese_model

itzjacki6mo ago

Exactly. The only way this could happen in the first place was _because_ they failed at so many levels. And as a result, more layers of Swiss cheese will be added, and holes in existing ones will be patched. This process is the reason flying is so safe, and the reason why Cloudflare will be a little bit more resilient tomorrow than it was yesterday.

miki1232116mo ago

In organizations with this level of care, if you fail at fewer levels, customers just never notice the error.

Any big and noticeable incident is one of the "we failed on so many levels here" kind, by definition.

michaelt6mo ago

> Why did they think this was a cyberattack

Isn’t getting cyberattacked their core business?

Yokohiii6mo ago

If so, why is their discovery of, non ambiguous?

cowsandmilk6mo ago

> Why do they do non-critical changes in production before testing in a stage environment?

I guess the noncritical change here was the change to the database? My experience has been a lot of teams do a poor job having a faithful replica of databases in stage environments to expose this type of issue.

brookst6mo ago

In part because it is somewhere between really hard and impossible. Is your staging DB going to be as big? Seeing the same RPS as prod? Seeing the same scenarios?

Permissions stuff might be caught without a completely faithful replica, but there are always going to be attributes of the system that only exist in prod.

nip6mo ago

It’s easy to pick on logic that failed and for which you have a very detailed and great post mortem write-up.

Yet you omit to acknowledge that the remaining 99.99999% logic written that powers Cloudflare works flawlessly.

Also, hindsight is 20/20

Yokohiii6mo ago

You are less critical with CF then they are with themselves.

A system that is 99.99999% flawless, can still be unusable.

optimism bias: 100/100

aforwardslash6mo ago

I know its easy to criticize what happened after the fact and having a clear(er) picture of all the moving parts and the timeline of events, but I think that while most of the people in the thread are pointing out either Rust-related or lack of configuration validation, what really grinds my gears is something that - in my opinion - is bad engineering.

Having an unprivileged application querying system.columns to infer the table layout is just bad; Not having a proper, well-defined table structure indicates sloppiness in the overall schema design, specially if it changes quickly. Considering specifically clickhouse, and even if this approach would be a good idea, the unprivileged way of doing it would be "DESCRIBE TABLE <name>", NOT iterating system.columns. The gist of it - sloppy design not even well implemented.

Having a critical application issuing ad-hoc commands to system.* tablespace instead of using a well-tested library is just amateurism, and again - bad engineering; IMO it is good practice to consider all system.* privileged applications and ensure their querying is completely separate from your application logic; Sometimes some system tables change, and fields are added and/or removed - not planning for this will basically make future compatibility a nightmare.

Not only the problematic query itself, but the whole context of this screams "lack of proper application design" and devs not knowing how to use the product and/or read the documentation. Granted, this is a bit "close to home" for me, because I use ClickHouse extensively (at a scale - I'm assuming - several orders of magnitude smaller than CloudFlare) and I have spent a lot of time designing specifically to avoid at least some of these kind of mistakes. But, if I can do it at my scale, why aren't they doing it?

Yokohiii6mo ago

On all the other issues, I thought they wanted to do the right thing at heart, but missed to make it fail safe. I can pass it as a problem of a journey to maturity or simply the fact that you can't get everything perfect. Maybe even a bit of sloppiness here and there.

The database issue screamed at me: lack of expertise. I don't use CH, but seeing someone to mess with a production system and they being surprised "Oh, it does that?", is really bad. And this is obviously not knowledge that is hard to achieve, buried deep in a manual or an edge case only discoverable by source code, it's bread and butter knowledge you should know.

What is confusing, that they didn't add this to their follow-up steps. With some benefit of doubt I'd assume they didn't want to put something very basic as a reason out there, just to protect the people behind it from widespread blame. But if that's not the case, then it's a general problem. Sadly it's not uncommon that components like databases are dealt with, on an low effort basis. Just a thing we plug in and works. But it's obviously not.

raxxorraxor6mo ago

I don't think these are realistic requirements for any engineered system to be honest. Realistic is to have contingencies for such cases, which are simply errors.

But the case for Cloudflare here is complicated. Every engineer is very free to make a better system though.

polack6mo ago

What is not realistic? To do simple input validation on data that has the potential to break 20% of the internet? To not have a system in place to rollback to the latest known state when things crash?

Cloudflare builds a global scale system, not an iphone app. Please act like it.

raxxorraxor6mo ago

Cloudflares success was simplicity to build a distributed system in different data centers around the world to be implemented by third party IT workers while Cloudflare were a few people. There are probably a lot of shitty iPhone apps that do less important work and are vastly more complex than the former Cloudflare server node configuration.

Every system has a non-reducible risk and no data rollback is trivial, especially for a CDN.

aquariusDue6mo ago

Yeah, I don't quite understand the people cutting Cloudflare massive slack. It's not about nailing blame on a single person or a team, it's about keeping a company that is THE closest thing to a public utility for the web accountable. They more or less did a Press Release with a call to action to buy or use their services at the end and everybody is going "Yep, that's totally fine. Who hasn't sent a bug to prod, amirite?".

It goes over my head why Cloudflare is HN's darling while others like Google, Microsoft and AWS don't usually enjoy the same treatment.

1 more reply

dspillett6mo ago

> To do simple input validation on data that has the potential to break 20% of the internet?

There will always be bugs in code, even simple code, and sometimes those things don't get caught before they cause significant trouble.

The failing here was not having a quick rollback option, or having it and not hitting the button soon enough (even if they thought the problem was probably something else, I think my paranoia about my own code quality is such that I would have been rolling back much sooner just in case I was wrong about the “something else”).

vsl6mo ago

> Why did they think this was a cyberattack and only after two hours realize it was the config file?

They explain that at some length in TFA.

jve6mo ago

> I'm migrating my customers off Cloudflare.

Is that an overreaction?

Name me global, redundant systems that have not (yet) failed.

And if you used cloudflare to protect against botnet and now go off cloudflare... you are vulnerable and may experience more downtime if you cannot swallow the traffic.

I mean no service have 100% uptime - just that some have more nines than others.

Carriethebest6mo ago

There are many self-hosted alternatives to protect against botnet. We don't have to use cloudflare. Everthing is under their control!

sofixa6mo ago

> There are many self-hosted alternatives to protect against botnet

Whatever you do, unless you have their bandwidth capacity, at some point those "self-hosted" will get flooded with traffic.

1 more reply

KronisLV6mo ago

> There are many self-hosted alternatives to protect against botnet.

What would some good examples of those be? I think something like Anubis is mostly against bot scraping, not sure how you'd mitigate a DDoS attack well with self-hosted infra if you don't have a lot of resources?

On that note, what would be a good self-hosted WAF? I recall using mod_security with Apache and the OWASP ruleset, apparently the Nginx version worked a bit slower (e.g. https://www.litespeedtech.com/benchmarks/modsecurity-apache-... ), there was also the Coraza project but I haven't heard much about it https://coraza.io/ or maybe the people who say that running a WAF isn't strictly necessary also have a point (depending on the particular attack surface).

Genuine questions.

1 more reply

jve6mo ago

Well if you self host DDoS protection service, that would be VERY expensive. You would need rent rack space along with a very fast internet connection at multiple data centers to host this service.

purple_turtle6mo ago

Can you name three of this many alternatives?

How they magically manage DDOS larger than their bandwidth?

If the plan is to have larger bandwidth than any DDOS it is going to be expensive, quickly.

1 more reply

nijave6mo ago

We had better uptime with AWS WAF in us-east-1 than we've had in the last 1.5 years of Cloudflare.

I do like the flat cost of Cloudflare and feature set better but they have quite a few outages compared to other large vendors--especially with Access (their zero trust product)

I'd lump them into GitHub levels of reliability

We had a comparable but slightly higher quote from an Akamai VAR.

polack6mo ago

Yes, it's probably an overreaction.

But at the same time, what value do they add if they:

* Took down the the customers sites due to their bug.

* Never protected against an attack that our infra could not have handled by itself.

* Don't think that they will be able to handle the "next big ddos" attack.

It's just an extra layer of complexity for us. I'm sure there are attacks that could help our customers with, that's why we're using them in the first place. But until the customers are hit with multiple ddos attacks that we can not handle ourself then it's just not worth it.

dspillett6mo ago

> • Took down the the customers sites due to their bug.

That is always a risk with using a 3rd party service, or even adding extra locally managed moving parts. We use them in DayJob, and despite this huge issue and the number of much smaller ones we've experienced over the last few years their reliability has been pretty darn good (at least as good as the Azure infrastructure we have their services sat in front of).

> • Never protected against an attack that our infra could not have handled by itself.

But what about the next one… Obviously this is a question sensitive to many factors in our risk profiles and attitudes to that risk, there is no one right answer to the “but is it worth it?” question here.

On a slightly facetious point: if something malicious does happen to your infrastructure, that it does not cope well with, you won't have the “everyone else is down too” shield :) [only slightly facetious because while some of our clients are asking for a full report including justification for continued use of CF and any other 3rd parties, which is their right both morally and as written in our contracts, most, especially those who had locally managed services affected, have taken the “yeah, half our other stuff was affected to, what can you do?” viewpoint].

> • Don't think that they will be able to handle the "next big ddos" attack.

It is a war of attrition. At some point a new technique, or just a new botnet significantly larger than those seen before, will come along that they might not be able to deflect quickly. I'd be concerned if they were conceited enough not to be concerned about that possibility. Any new player is likely to practise on smaller targets first before directly attacking CF (in fact I assume that it is rather rare that CF is attacked directly) or a large enough segment of their clients to cause them specific issues. Could your infrastructure do any better if you happen to be chosen as one of those earlier targets?

Again, I don't know your risk profile so can say which is the right answer, if there even is an easy one other than “not thinking about it at all” being a truly wrong answer. Also DDoS protection is not the only service many use CF for, so those need to be considered too if you aren't using them for that one thing.

tete6mo ago

I agree. I think the comments about how "it is fine, because so many things had to fail" do not apply in this case.

It's not that many things had to fail, it's that many things that are obvious haven't been done. It would be a valid excuse if many "exotic" scenarios would have to align, not when it's obvious error cases that weren't handled and changes have not been tested.

While having wrong first assumptions is just how things work when you try to analyze the issue[1], not testing changes before production is just stupidity and nothing else.

The story would be different if eg. multiple unlikely, hard to track things happened at once without anyone making a clearly linkable event, something that would also happen in staging. Most of the things mentioned could essentially statically checked. This is the prime example of what you want as any tech person, because it's not hard to prevent compared to a lot of scenarios where you deal with balancing likelihoods of scenarios, timings, etc.

You don't think someone is a great plumber, because they forgot their tools and missed that big hole in the pipe and also rang at the wrong door, because all these things failed. You think someone is a good plumber if they said they would have to go back to fetch a bulky specialized tool, because this is the rare case in which they need it, but they could also do this other thing in this spcific case. They are great plumbers if they tell you how this happened in first place and how to fix it. They are great plumbers if they manage to fix something outside of their usual scope.

Here pretty much all of the things that you pay them for failed. At a large scale.

I am sure this has there are reasons which we don't now about, and I hope that CloudFlare can fix them. Be it management focusing on the wrong things, be it developers not being in the wrong position or annoyed enough to care or something else entirely. However, not doing these things is (likely) a sign that currently they are not in the state of creating reliable systems - at least none reliable enough for what they are doing. It would be perfectly fine if they ran a web shop or something, but if as experienced many other companies rely on you being up or their stuff fails, then maybe you should not run a company with products like "Always Online".

[1] And should make you adapt the process of analyzing issues. Eg. making sure config changes are "very loud" in monitoring. It's one of the most easily tracked thing that can go wrong, and can relatively easily be mapped to a point in time compared to many other things.

kosolam6mo ago

So where are you migrating to?

JB_Dev6mo ago

Does their ring based rollout really truly have to be 0->100% in a few seconds?

I don’t really buy this requirement. At least make it configurable with a more reasonable default for “routine” changes. E.g. ramping to 100% over 1 hour.

As long as that ramp rate is configurable, you can retain the ability to respond fast to attacks by setting the ramp time to a few seconds if you truly think it’s needed in that moment.

cowsandmilk6mo ago

The configuration file is updated every five minutes, so clearly they have some past experience where they’ve decided an hour is too long. That said, even a roll out over five minutes can be helpful.

NicoJuicy6mo ago

I think defence against a DDOS against your network is the best reason for a quick rollout

matteocontrini6mo ago

This was not about DDoS defense but the Bot Management feature, which is a paid Enterprise-only feature not enabled by default to block automated requests regardless of whether an attack is going on.

https://developers.cloudflare.com/bots/get-started/bot-manag...

nijave6mo ago

Bots can also cause a DoS/DDoS. We use the feature to restrict certain AI scraper tools by user agent that adversly impact performance (they have a tendency to hammer "export all the data" endpoints much more than regular users do)

inemesitaffia6mo ago

So if you didn't enable it your stuff would work?

1 more reply

jabl6mo ago

Maybe, but in that case maybe have some special casing logic to detect that yes indeed we're under a massive DDOS at this very moment, do a rapid rollout of this thing that will mitigate said DDOS. Otherwise use the default slower one?

Of course, this is all so easy to say after the fact..

xp846mo ago

Isn’t CF under a ‘massive DDOS’ 24/7 pretty much by definition? When does malicious traffic rest, and how many targets of same aren’t using CF?

NicoJuicy6mo ago

It's literally in the blog post as well

> In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:

rkachowski6mo ago

Classic @devops_borat

"To make error is human. To propagate error to all server in automatic way is #devops"

throw0101c6mo ago

> "To make error is human. To propagate error to all server in automatic way is #devops"

This saying dates back to 1969: To err is human but to really foul things up requires a computer.

* https://quoteinvestigator.com/2010/12/07/foul-computer/

Also: I know there’s a proverb which says ‘To err is human,’ but a human error is nothing to what a computer can do if it tries.

* https://quoteinvestigator.com/2017/05/26/computer-error/

mongol6mo ago

I miss him. It must be more than 10 years now

ignoramous6mo ago

> Their bot management system is designed to push a configuration out to their entire network rapidly.

Once every 5m is not "rapidly". It isn't uncommon for configuration systems to do it every few seconds [0].

> While it’s certainly useful to examine the root cause in the code.

Believe the issue is as much an output from a periodic run (clickhouse query) caused by (on the surface, an unrelated change) causing this failure. That is, the system that validated the configuration (FL2) was different to the one that generated it (ML Bot Management DB).

Ideally, it is the system that vends a complex configuration that also vends & tests the library to consume it, or the system that consumes it, does so as if it was "tasting" the configuration first before devouring it unconditionally [1].

Of course, as with all distributed system failures, this is all easier said and done in hindsight.

[0] Avoiding overload in distributed systems by putting the smaller service in control (pg 4), https://d1.awsstatic.com/builderslibrary/pdfs/Avoiding%20ove...

[1] Lessons from CloudFront (2016), https://youtube.com/watch?v=n8qQGLJeUYA&t=1050

Hamuko6mo ago

>Once every 5m is not "rapidly".

Isn't rapidly more of how long it takes to get from A to Z rather than how often it is performed? You can push out a configuration update every fortnight but if it goes through all of your global servers in three seconds, I'd call it quite rapid.

abaloneOP6mo ago

By rapid I mean a rapid rollout of changes to 100% of the fleet, not how often changes are made.

evntdrvn6mo ago

Thanks for sharing that AWS doc

matternot6mo ago

I don't understand why they didn't validate and sanitize the new config file revision. If bad(whatever that reason is) throw an error and revert back to previous version. You don't need to take down the whole internet for that.

WJW6mo ago

Same as for almost every bug I think: the dev in question hadn't considered that the input could be bad in the way that it turned out to be. Maybe they were new, or maybe they hadn't slept much because of a newborn baby, or maybe they thought it was a reasonable assumption that there would never be more than 200 ML features in the array in question. I don't think this developer will ever make the same mistake again at least.

Let those who have never written a bug before cast the first stone.

adriand6mo ago

> Maybe they were new, or maybe they hadn't slept much because of a newborn baby

Reminds me of House of Dynamite, the movie about nuclear apocalypse that really revolves around these very human factors. This outage is a perfect example of why relying on anything humans have built is risky, which includes the entire nuclear apparatus. “I don’t understand why X wasn’t built in such a way that wouldn’t mean we live in an underground bunker now” is the sentence that comes to mind.

Yokohiii6mo ago

I don't think this is an error originating from a single human. At CF scale I'd expect that multiple humans saw that code and gave it a pass. Rust or not, but an experienced dev could have seen this can lead to issues. Panicking without restoring a healthy state is just not an option in this case. They *know* that.

I guess you are right, likely a social issue, but certainly not a single exhausted parent.

throw0101c6mo ago

> I don't understand why they didn't validate and sanitize the new config file revision.

The new config file was not (AIUI) invalid (syntax-wise) but rather too big:

> […] That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

matternot6mo ago

if the config is too big, then its an invalid config

dev_l1x_be6mo ago

Exactly the right take. Even when you want to have rapid changes on your infra, do it at least by region. You can start with the region where the least amount of users are impacted and if everything is fine, there is no elevated number of crashes for example, you can move forward. It was a standard practice at $RANDOM_FAANG when we had such deployments.

abaloneOP6mo ago

Thank you. I am sympathetic to CF’s need to deploy these configs globally fast and don’t think slowing down their DDoS mitigation is necessarily a good trade off. What I am saying is this presents a bigger reliability risk and needs correspondingly fine crafted observability around such config changes and a rollback runbook. Greater risk -> greater attention.

twoodfin6mo ago

But the rapid deployment mechanism for bot features wasn’t where the bug was introduced.

In fact, the root bug (faulty assumption?) was in one or more SQL catalog queries that were presumably written some time ago.

(Interestingly the analysis doesn’t go into how these erroneous queries made it into production OR whether the assumption was “to spec” and it’s the security principal change work that was faulty. Seems more likely to be the former.)

abaloneOP6mo ago

It was a change to the database that is used to generate a bot management config file. That file was the proximate cause for the panics. The kind of observability that would have helped here is “panics are elevated and here are the binary and config changes that preceded it,” along with a rollback runbook for it all.

Generally I would say we as an industry are more nonchalant about config changes vs binary changes. Where an org might have great processes and systems in place for binary rollouts, the whole fleet could be reading config from a database in a much more lax fashion. Those systems are quite risky actually.

1 more reply

mlrtime6mo ago

I've also led a team of Incident Commanders at a FAANG.

If this was a routine config change, I could see how it could take 2 hours to start the mediation plan. However they should have dashboards that correlate config setting changes with 500 errors (or equivalent). It gets difficult when you have many of of these going out at the same time and they are slowly rolled out.

The root cause document is mostly for high level and the public. The details on this specific outage will be in a internal document with many action items, some of them maybe quarter long projects including fixing this specific bug and maybe some linter/monitor to prevent it from happening again.

jbs7896mo ago

Thanks for this assessment.

In a productive way, this view also shifts the focus to improving the system (visibility etc), empowering the team, rather than focusing on the code which broke (probably strikes fear in the individuals, to do anything!)

HelloNurse6mo ago

The "coding error" is a somewhat deliberate choice to fail eagerly that is usually safe but doesn't align with the need to do something (propagation of the configuration file) without failing.

I'm sure that there are misapplied guidelines to do that instead of being nice to incoming bot management configuration files, and someone might have been scolded (or worse) for proposing or attempting to handle them more safely.

pas6mo ago

Rolling out new code should be done differently than rolling out new data to fight bots.

If every time there's a new bot someone needs to write code that can blow up their whole service, maybe they need to iterate a bit on this design?

watchful_moose6mo ago

This isn't what they do, though. This is a data/config push - original article says _a “feature file” used by our Bot Management system_

antihero6mo ago

I would say that whilst this is a good top down view, that `.unwrap()` should have been caught at code-review and not allowed. Clippy rule could have saved a lot of money.

That and why the hell wasn't their alerting showing up colossal amount of panics in their bot manager thing?

xmcqdpt26mo ago

Yes the lack of observability is really the disturbing bit here. You have panics in a bunch of your core infrastructure, you would expect there to be a big red banner on the dashboard that people look at when they first start troubleshooting an incident.

This is also a pretty good example why having stack traces by default is great. That error could have been immediately understood just from a stack trace and a basic exception message.

BrtByte6mo ago

You can write the safest code in the world, but if you're shipping config changes globally every few minutes without a robust rollback plan or telemetry that pinpoints when things go sideways, you're flying blind

tormeh6mo ago

Partial disagree. There should be lints against 'unwrap's. An 'expect' at least forces you to write down why you are so certain it can't fail. An unwrap is not just hubris, it's also laziness, and has no place in sensitive code.

And yes, there is a lint you can use against slicing ('indexing_slicing') and it's absolutely wild that it's not on by default in clippy.

milliams6mo ago

  [lints.clippy]
  dbg_macro = "deny"
  unwrap_used = "deny"
  expect_used = "deny"

tormeh6mo ago

Exactly. This should be the default for production code at companies like Cloudflare.

echelon6mo ago

https://github.com/search?q=unwrap%28%29+language%3ARust&typ...

This is sobering.

My new fear is some dependency unwrap()ing or expect()ing something where they didn't prove the correctness.

Unwrap() and expect() are an anti-pattern and have no place in idiomatic Rust code. The language should move to deprecate them.

an_ko6mo ago

I use unwrap a lot, and my most frequent target is unwrapping the result of Mutex::lock. Most applications have no reasonable way to recover from lock poisoning, so if I were forced to write a match for each such use site to handle the error case, the handler would have no choice but to just call panic anyway. Which is equivalent to unwrap, but much more verbose.

Perhaps it needs a scarier name, like "assume_ok".

2 more replies

ok1234566mo ago

Unwrap is in a lot of example code.

If you're using Result<T,E>, there's no automatic language feature for statically typing a nested E that mirrors how it was called.

So out of brevity, they unwrap.

Expect to see this sort of error crop up a lot as people use LLMs to vibe with the borrow checker.

ViewTrick10026mo ago

What would the difference be if they had enforced no unwraps/expects/slicing and instead logged the error and returned a 500?

As the user, I can't tell the difference, but it might have sped up their recovery a bit.

tomtomtom7776mo ago

You could argue that explicitly writing down the assumption would make it clearer to yourself and your reviewer that it might be overly optimistic.

ljm6mo ago

Pretty much - the time spent ruling out the hypothesis that it was a cyberattack would have been time spent investigating the uptick in deliberately written error logs, since you would expect alerts to be triggered if those exceed a threshold.

I imagine it would also require less time debugging a panic. That kind of breadcrumb trail in your logs is a gift to the future engineer and also customers who see a shorter period of downtime.

selfmodruntime6mo ago

I would love to go further and explicitely forbid unwrap and similar calls using a `no_panic` attribute.

stevefan19996mo ago

I actually have to do this for programs that runs in bare metal. You can't afford to have nondeterministic panic like this. If things really gone wrong you'd have a watchdog and health checker to verify the state of program.

selfmodruntime6mo ago

How do you manage to do this?

1 more reply

zelphirkalt6mo ago

It is just 2 different layers. Of course the code is also a problem, if it is in fact as the GP describes it. You are taking the higher level view, which is the second layer of dealing with not only this specific mistake, but also other mistakes, that can be related to arbitrary code paths.

Both are important, and I am pretty sure, that someone is gonna fix that line of code pretty soon.

seethishat6mo ago

The bot is efficient. This is by design. It will push out mistakes just as efficiently as it pushes out good changes. Good or bad... the plane of control is unchanged.

This is the danger of automated control systems. If they get hacked or somehow push out bad things (CloudStrike), they will have complete control and be very efficient.

nrhrjrjrjtntbt6mo ago

This guy SREs

ithkuil6mo ago

Back when it meant Site Reliability Engineer and not Sysadmin Really Expensive

j / k navigate · click thread line to collapse

0 comments

polack6mo ago

They failed on so many levels here.

How can you write the proxy without handling the config containing more than the maximum features limit you set yourself?

How can the database export query not have a limit set if there is a hard limit on number of features?

Why do they do non-critical changes in production before testing in a stage environment?

Why did they think this was a cyberattack and only after two hours realize it was the config file?

Why are they that afraid of a botnet? Does not leave me confident that they will handle the next Aisuru attack.

huijzer6mo ago

> They failed on so many levels here.

That's often the case with human error as especially aviation safety experts know: https://en.wikipedia.org/wiki/Swiss_cheese_model

itzjacki6mo ago

miki1232116mo ago

In organizations with this level of care, if you fail at fewer levels, customers just never notice the error.

Any big and noticeable incident is one of the "we failed on so many levels here" kind, by definition.

michaelt6mo ago

> Why did they think this was a cyberattack

Isn’t getting cyberattacked their core business?

Yokohiii6mo ago

If so, why is their discovery of, non ambiguous?

cowsandmilk6mo ago

> Why do they do non-critical changes in production before testing in a stage environment?

brookst6mo ago

In part because it is somewhere between really hard and impossible. Is your staging DB going to be as big? Seeing the same RPS as prod? Seeing the same scenarios?

Permissions stuff might be caught without a completely faithful replica, but there are always going to be attributes of the system that only exist in prod.

nip6mo ago

It’s easy to pick on logic that failed and for which you have a very detailed and great post mortem write-up.

Yet you omit to acknowledge that the remaining 99.99999% logic written that powers Cloudflare works flawlessly.

Also, hindsight is 20/20

Yokohiii6mo ago

You are less critical with CF then they are with themselves.

A system that is 99.99999% flawless, can still be unusable.

optimism bias: 100/100

aforwardslash6mo ago

Yokohiii6mo ago

raxxorraxor6mo ago

I don't think these are realistic requirements for any engineered system to be honest. Realistic is to have contingencies for such cases, which are simply errors.

But the case for Cloudflare here is complicated. Every engineer is very free to make a better system though.

polack6mo ago

Cloudflare builds a global scale system, not an iphone app. Please act like it.

raxxorraxor6mo ago

Every system has a non-reducible risk and no data rollback is trivial, especially for a CDN.

aquariusDue6mo ago

It goes over my head why Cloudflare is HN's darling while others like Google, Microsoft and AWS don't usually enjoy the same treatment.

1 more reply

dspillett6mo ago

> To do simple input validation on data that has the potential to break 20% of the internet?

There will always be bugs in code, even simple code, and sometimes those things don't get caught before they cause significant trouble.

vsl6mo ago

> Why did they think this was a cyberattack and only after two hours realize it was the config file?

They explain that at some length in TFA.

jve6mo ago

> I'm migrating my customers off Cloudflare.

Is that an overreaction?

Name me global, redundant systems that have not (yet) failed.

And if you used cloudflare to protect against botnet and now go off cloudflare... you are vulnerable and may experience more downtime if you cannot swallow the traffic.

I mean no service have 100% uptime - just that some have more nines than others.

Carriethebest6mo ago

There are many self-hosted alternatives to protect against botnet. We don't have to use cloudflare. Everthing is under their control!

sofixa6mo ago

> There are many self-hosted alternatives to protect against botnet

Whatever you do, unless you have their bandwidth capacity, at some point those "self-hosted" will get flooded with traffic.

1 more reply

KronisLV6mo ago

> There are many self-hosted alternatives to protect against botnet.

Genuine questions.

1 more reply

jve6mo ago

Well if you self host DDoS protection service, that would be VERY expensive. You would need rent rack space along with a very fast internet connection at multiple data centers to host this service.

purple_turtle6mo ago

Can you name three of this many alternatives?

How they magically manage DDOS larger than their bandwidth?

If the plan is to have larger bandwidth than any DDOS it is going to be expensive, quickly.

1 more reply

nijave6mo ago

We had better uptime with AWS WAF in us-east-1 than we've had in the last 1.5 years of Cloudflare.

I do like the flat cost of Cloudflare and feature set better but they have quite a few outages compared to other large vendors--especially with Access (their zero trust product)

I'd lump them into GitHub levels of reliability

We had a comparable but slightly higher quote from an Akamai VAR.

polack6mo ago

Yes, it's probably an overreaction.

But at the same time, what value do they add if they:

* Took down the the customers sites due to their bug.

* Never protected against an attack that our infra could not have handled by itself.

* Don't think that they will be able to handle the "next big ddos" attack.

dspillett6mo ago

> • Took down the the customers sites due to their bug.

> • Never protected against an attack that our infra could not have handled by itself.

> • Don't think that they will be able to handle the "next big ddos" attack.

tete6mo ago

I agree. I think the comments about how "it is fine, because so many things had to fail" do not apply in this case.

While having wrong first assumptions is just how things work when you try to analyze the issue[1], not testing changes before production is just stupidity and nothing else.

Here pretty much all of the things that you pay them for failed. At a large scale.

kosolam6mo ago

So where are you migrating to?

JB_Dev6mo ago

Does their ring based rollout really truly have to be 0->100% in a few seconds?

I don’t really buy this requirement. At least make it configurable with a more reasonable default for “routine” changes. E.g. ramping to 100% over 1 hour.

As long as that ramp rate is configurable, you can retain the ability to respond fast to attacks by setting the ramp time to a few seconds if you truly think it’s needed in that moment.

cowsandmilk6mo ago

NicoJuicy6mo ago

I think defence against a DDOS against your network is the best reason for a quick rollout

matteocontrini6mo ago

This was not about DDoS defense but the Bot Management feature, which is a paid Enterprise-only feature not enabled by default to block automated requests regardless of whether an attack is going on.

https://developers.cloudflare.com/bots/get-started/bot-manag...

nijave6mo ago

inemesitaffia6mo ago

So if you didn't enable it your stuff would work?

1 more reply

jabl6mo ago

Of course, this is all so easy to say after the fact..

xp846mo ago

Isn’t CF under a ‘massive DDOS’ 24/7 pretty much by definition? When does malicious traffic rest, and how many targets of same aren’t using CF?

NicoJuicy6mo ago

It's literally in the blog post as well

> In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:

rkachowski6mo ago

Classic @devops_borat

"To make error is human. To propagate error to all server in automatic way is #devops"

throw0101c6mo ago

> "To make error is human. To propagate error to all server in automatic way is #devops"

This saying dates back to 1969: To err is human but to really foul things up requires a computer.

* https://quoteinvestigator.com/2010/12/07/foul-computer/

Also: I know there’s a proverb which says ‘To err is human,’ but a human error is nothing to what a computer can do if it tries.

* https://quoteinvestigator.com/2017/05/26/computer-error/

mongol6mo ago

I miss him. It must be more than 10 years now

ignoramous6mo ago

> Their bot management system is designed to push a configuration out to their entire network rapidly.

Once every 5m is not "rapidly". It isn't uncommon for configuration systems to do it every few seconds [0].

> While it’s certainly useful to examine the root cause in the code.

Of course, as with all distributed system failures, this is all easier said and done in hindsight.

[0] Avoiding overload in distributed systems by putting the smaller service in control (pg 4), https://d1.awsstatic.com/builderslibrary/pdfs/Avoiding%20ove...

[1] Lessons from CloudFront (2016), https://youtube.com/watch?v=n8qQGLJeUYA&t=1050

Hamuko6mo ago

>Once every 5m is not "rapidly".

abaloneOP6mo ago

By rapid I mean a rapid rollout of changes to 100% of the fleet, not how often changes are made.

evntdrvn6mo ago

Thanks for sharing that AWS doc

matternot6mo ago

WJW6mo ago

Let those who have never written a bug before cast the first stone.

adriand6mo ago

> Maybe they were new, or maybe they hadn't slept much because of a newborn baby

Yokohiii6mo ago

I guess you are right, likely a social issue, but certainly not a single exhausted parent.

throw0101c6mo ago

> I don't understand why they didn't validate and sanitize the new config file revision.

The new config file was not (AIUI) invalid (syntax-wise) but rather too big:

> […] That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

matternot6mo ago

if the config is too big, then its an invalid config

dev_l1x_be6mo ago

abaloneOP6mo ago

twoodfin6mo ago

But the rapid deployment mechanism for bot features wasn’t where the bug was introduced.

In fact, the root bug (faulty assumption?) was in one or more SQL catalog queries that were presumably written some time ago.

abaloneOP6mo ago

1 more reply

mlrtime6mo ago

I've also led a team of Incident Commanders at a FAANG.

jbs7896mo ago

Thanks for this assessment.

HelloNurse6mo ago

The "coding error" is a somewhat deliberate choice to fail eagerly that is usually safe but doesn't align with the need to do something (propagation of the configuration file) without failing.

pas6mo ago

Rolling out new code should be done differently than rolling out new data to fight bots.

If every time there's a new bot someone needs to write code that can blow up their whole service, maybe they need to iterate a bit on this design?

watchful_moose6mo ago

This isn't what they do, though. This is a data/config push - original article says _a “feature file” used by our Bot Management system_

antihero6mo ago

I would say that whilst this is a good top down view, that `.unwrap()` should have been caught at code-review and not allowed. Clippy rule could have saved a lot of money.

That and why the hell wasn't their alerting showing up colossal amount of panics in their bot manager thing?

xmcqdpt26mo ago

This is also a pretty good example why having stack traces by default is great. That error could have been immediately understood just from a stack trace and a basic exception message.

BrtByte6mo ago

tormeh6mo ago

And yes, there is a lint you can use against slicing ('indexing_slicing') and it's absolutely wild that it's not on by default in clippy.

milliams6mo ago

  [lints.clippy]
  dbg_macro = "deny"
  unwrap_used = "deny"
  expect_used = "deny"

tormeh6mo ago

Exactly. This should be the default for production code at companies like Cloudflare.

echelon6mo ago

https://github.com/search?q=unwrap%28%29+language%3ARust&typ...

This is sobering.

My new fear is some dependency unwrap()ing or expect()ing something where they didn't prove the correctness.

Unwrap() and expect() are an anti-pattern and have no place in idiomatic Rust code. The language should move to deprecate them.

an_ko6mo ago

Perhaps it needs a scarier name, like "assume_ok".

2 more replies

ok1234566mo ago

Unwrap is in a lot of example code.

If you're using Result<T,E>, there's no automatic language feature for statically typing a nested E that mirrors how it was called.

So out of brevity, they unwrap.

Expect to see this sort of error crop up a lot as people use LLMs to vibe with the borrow checker.

ViewTrick10026mo ago

What would the difference be if they had enforced no unwraps/expects/slicing and instead logged the error and returned a 500?

As the user, I can't tell the difference, but it might have sped up their recovery a bit.

tomtomtom7776mo ago

You could argue that explicitly writing down the assumption would make it clearer to yourself and your reviewer that it might be overly optimistic.

ljm6mo ago

I imagine it would also require less time debugging a panic. That kind of breadcrumb trail in your logs is a gift to the future engineer and also customers who see a shorter period of downtime.

selfmodruntime6mo ago

I would love to go further and explicitely forbid unwrap and similar calls using a `no_panic` attribute.

stevefan19996mo ago

selfmodruntime6mo ago

How do you manage to do this?

1 more reply

zelphirkalt6mo ago

Both are important, and I am pretty sure, that someone is gonna fix that line of code pretty soon.

seethishat6mo ago

The bot is efficient. This is by design. It will push out mistakes just as efficiently as it pushes out good changes. Good or bad... the plane of control is unchanged.

This is the danger of automated control systems. If they get hacked or somehow push out bad things (CloudStrike), they will have complete control and be very efficient.

nrhrjrjrjtntbt6mo ago

This guy SREs

ithkuil6mo ago

Back when it meant Site Reliability Engineer and not Sysadmin Really Expensive

j / k navigate · click thread line to collapse