Cloudflare outage on February 20, 2026 (opens in new tab)

(blog.cloudflare.com)

190 pointsnomaxx1173mo ago125 comments

125 comments

It's something we debated in our team: if there's an API that returns data based on filters, what's the better behavior if no filters are provided - return everything or return nothing?

The consensus was that returning everything is rarely what's desired, for two reasons: first, if the system grows, allowing API users to return everything at once can be a problem both for our server (lots of data in RAM when fetching from the DB => OOM, and additional stress on the DB) and for the user (the same problem on their side). Second, it's easy to forget to specify filters, especially in cases like "let's delete something based on some filters."

So the standard practice now is to return nothing if no filters are provided, and we pay attention to it during code reviews. If the user does really want all the data, you can add pagination to your API. With pagination, it's very unlikely for the user to accidentally fetch everything because they must explicitly work with pagination tokens, etc.

Another option, if you don't want pagination, is to have a separate method named accordingly, like ListAllObjects, without any filters.

alemanek3mo ago

Returning an empty result in that case may cause a more subtle failure. I would think returning an error would be a bit better as it would clearly communicate that the caller called the API endpoint incorrectly. If it’s HTTP a 400 Bad Request status code would seem appropriate.

1 more reply

Thaxll3mo ago

Neither of your options are good, the first question you need to ask is that is the filter optional or not ( this is a contract / API question ).

If not optional then return 400, otherwise return all the results ( and have pagination ).

You should always have pagination in an API.

Philip-J-Fry3mo ago

>allowing API users to return everything at once can be a problem both for our server (lots of data in RAM when fetching from the DB => OOM, and additional stress on the DB)

You can limit stress on RAM by streaming the data. You should ideally stream rows for any large dataset. Otherwise, like you say you are loading the entire thing into RAM.

jiggawatts3mo ago

Not to mention the latency reduction!

Buffering up the entire data set before encoding it to JSON and sending it is one of the biggest sources of latency in API based software. Streaming can get latencies down to tens of microseconds!

qwertyuiop_3mo ago

how about returning an error ? It’s the generic “client sent something wrong” bucket. Missing a required filter param is unambiguously a client mistake according to your own docs/contract → client error → 4xx family → 400 is the safest/default member of that family.

MobileVet3mo ago

I like your thought process around the ‘empty’ case. While the opposite of a filter is no filter, to your point, that is probably not really the desire when it comes to data retrieval. We might have to revisit that ourselves.

PunchyHamster3mo ago

But that query had parameter. They just fucked up parsing it

est3mo ago

> to have a separate method named accordingly, like ListAllObjects, without any filters

For me it's like `filter1=*`

CommonGuy3mo ago

Insufficient mock data in the staging environment? Like no BYOIP prefixes at all? Since even one prefix should have shown that it would be deleted by that subtask...

From all the recent outages, it sounds like Cloudflare is barely tested at all. Maybe they have lots of unit tests etc, but they do not seem to test their whole system... I get that their whole setup is vast, but even testing that subtask manually would have surfaced the bug

zmj3mo ago

Testing the "whole system" for a mature enterprise product is quite difficult. The combinatorial explosion of account configurations and feature usage becomes intractable on two levels: engineers can't anticipate every scenario they need their tests to cover (because the product is too big understand the whole of), and even if comprehensive testing was possible - it would be impractical on some combination of time, flakiness, and cost.

dabinat3mo ago

I think Cloudflare does not sufficiently test lesser-used options. I lurk in the R2 Discord and a lot of users seem to have problems with custom domains.

asciii3mo ago

It was also merged 15 days prior to production release...however, you're spot on with the empty test. That's a basic scenario that if it returned all...is like oh no.

suhputt3mo ago

my guess is the company is rotting from the inside and drowning in tech debt

martinald3mo ago

Just crazy. Why does a staging environment matter? They should be running some integration tests against eg an in memory database for these kinds of tasks surely?

otar3mo ago

Reliability was/is CF's label.

It's alarming already. Too many outages in the past months. CF should fix it, or it becomes unacceptable and people will leave the platform.

I really hope they will figure things out.

tallytarik3mo ago

We’re still waiting on a solution for https://www.cloudflarestatus.com/incidents/391rky29892m (which actually started a month earlier than the incident reports)

In the meantime, as you say, we’re now going through and evaluating other vendors for each component that CF provides - which is both unfortunate, and a frustrating use of time, as CF’s services “just worked” very well for a very long time.

argestes3mo ago

I have many things dependent on Cloudflare. That makes me root for Cloudflare and I think I'm not the only one. Instead of finding better options we're getting stuck on an already failing HA solution. I wonder what caused this.

slothsarecool3mo ago

There are no alternatives, and those alternatives that did exist back in the day, had to shut down due to either going out of business or not being able to keep a paygo model.

Not everybody needs cloudflare, but those that need it and aren't major enterprises, have no other option.

Sanzig3mo ago

Bunny.net? Doesn't have near the same feature set as Cloudflare, but the essentials are there and you can easily pay as you go with a credit card.

1 more reply

pocksuppet3mo ago

Lots of people who think they need Cloudflare don't. What are you using it for?

1 more reply

arcatech3mo ago

Do you not feel concern about you and everybody else deciding to put ALL of their eggs into one basket like this?

ranger_danger3mo ago

I would bet money that most people who use CF now are already hosting their endpoints at a single provider. I don't think most people care until it actually becomes enough of a problem.

esseph3mo ago

Like AWS/GCP/Azure?

alansaber3mo ago

Not sure why everyone is complaining, new MCP features are more important than uptime

NinjaTrance3mo ago

The irony is that the outage was caused by a change from the "Code Orange: Fail Small initiative".

They definitely failed big this time.

vimda3mo ago

One has to wonder when the board realises Dane was a bad replacement for JGC. These outages are getting ridiculous

blibble3mo ago

is this blog post LLM generated?

the explanation makes no sense:

> Because the client is passing pending_delete with no value, the result of Query().Get(“pending_delete”) here will be an empty string (“”), so the API server interprets this as a request for all BYOIP prefixes instead of just those prefixes that were supposed to be removed. The system interpreted this as all returned prefixes being queued for deletion.

client:

     resp, err := d.doRequest(ctx, http.MethodGet, `/v1/prefixes?pending_delete`, nil)

server:

    if v := req.URL.Query().Get("pending_delete"); v != "" {
        // ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
        prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
        if err != nil {
            api.RenderError(ctx, w, ErrInternalError)
            return
        }

        api.Render(ctx, w, http.StatusOK, renderIPPrefixAPIResponse(prefixes, nil))
        return
    }

even if the client had passed a value it would have still done exactly the same thing, as the value of "v" (or anything from the request) is not used in that block

PunchyHamster3mo ago

better explanation here https://news.ycombinator.com/item?id=47106852

but in short they are changing whether string is empty, and query string "pending_delete" is same as "pending_delete=" and will return empty

Or, if they specified `/v1/prefixes?pending_delete=potato` it would return "correct" list of objects to delete

Or in other words "Go have types safety, fuck it, let's use strings like in '90s PHP apps instead"

lenkite3mo ago

If Go supported Optional's, this bug would not have surfaced.

PunchyHamster3mo ago

You can probably make it with generics.

But if I was given option to steal one feature out of other languages it would be enums and resulting Result/Optional from Rust

bstsb3mo ago

doesn't look AI-generated. even if they have made a mistake, it's probably just from the rush of getting a postmortem out prior to root cause analysis

bretthoerner3mo ago

> even if the client had passed a value it would have still done exactly the same thing, as the value of "v" (or anything from the request) is not used in that block

If they passed in any value, they would have entered the block and returned early with the results of FetchPrefixesPendingDeletion.

From the post:

> this was implemented as part of a regularly running sub-task that checks for BYOIP prefixes that should be removed, and then removes them.

They expected to drop into the block of code above, but since they didn't, they returned all routes.

blibble3mo ago

okay so the code which returned everything isn't there

actual explanation: the API server by default returns everything. the client attempted to make a request to return "pending_deletes", but as the request was malformed, the API instead went down the default path, which returned everything. then the client deleted everything.

makes sense now

but is that explanation is even worse

because that means the code path was never tested?

jbxntuehineoh3mo ago

or they tested it, but not with a dataset that contained prefixes not pending deletion

asuffield2mo ago

Hi! I wrote this paragraph. I promise that I'm not an LLM, but I was in about hour 10 of my work day and I was asleep not long after writing this. Any failures in comprehensibility are from exhaustion.

(Other comments have explained the bug so I won't repeat them)

subscribed3mo ago

That's weird. They only removed some 6 of our prefixes out of perhaps 40 we have with them, so something seems off in this explanation.

himata41133mo ago

yep, no mention that re-advertised prefixes would be withdrawn again as well during the entire impact even after they shut it down.

atty3mo ago

I do not work in the space at all, but it seems like Cloudflare has been having more network disruptions lately than they used to. To anyone who deals with this sort of thing, is that just recency bias?

Icathian3mo ago

It is not. They went about 5 years without one of these, and had a handful over the last 6 months. They're really going to need to figure out what's going wrong and clean up shop.

NinjaTrance3mo ago

Engineers have been vibe coding a lot recently...

jsheard3mo ago

The featured blog post where one of their senior engineering PMs presented an allegedly "production grade" Matrix implementation, in which authentication was stubbed out as a TODO, says it all really. I'm glad a quarter of the internet is in such responsible hands.

4 more replies

dakiol3mo ago

No joke. In my company we "sabotaged" the AI initiative led by the CTO. We used LLMs to deliver features as requested by the CTO, but we introduced a couple of bugs here and there (intentionally). As a result, the quarter ended up with more time allocated to fix bugs and tons of customer claims. The CTO is now undoing his initiative. We all have now some time more to keep our jobs.

7 more replies

Ylpertnodi3mo ago

Typo: "shop", should have been with an 'el'.

(: phonetically, because 'l's are hard to read.

lysace3mo ago

It has been roughly speaking five and a half years since the IPO. The original CTO (John Graham-Cumming) left about a year ago.

jacquesm3mo ago

They coasted on momentum for half a year. I don't even think it says anything negative about the current CTO, but more of what an exception JGC is relative to what is normal. A CTO leaving would never show up the next day in the stats, the position is strategic after all. But you'd expect to see the effect after a while, 6 months is longer than I would have expected, but short enough that cause and effect are undeniable.

Even so, it is a strong reminder not to rely on any one vendor for critical stuff, in case that wasn't clear enough yet.

lysace3mo ago

You can coast for quite some time (5-10 years?) if you really lean into it (95% of the knowledge of maintaining and scaling the stack is there in the minds of hundreds of developers).

Seems like Matthew Prince didn't choose that route.

1 more reply

dazc3mo ago

I wondered what happened to him?

jgrahamc3mo ago

I am reading HN.

1 more reply

brcmthrowaway3mo ago

He's on a yacht somewhere

1 more reply

Cipater3mo ago

He's still a member of the board though.

dazc3mo ago

Launching a new service every 5 minutes is obviously stretching their resources.

Betelbuddy3mo ago

Cloudflare Outages are as predictable, as the Sun coming up tomorrow. Its their engineering culture.

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

candiddevmike3mo ago

Wait till you see the drama around their horrible terraform provider update/rewrite:

https://github.com/cloudflare/terraform-provider-cloudflare/...

slophater3mo ago

been at cf for 7 yrs but thinking of gtfo soon. the ceo is a manchild, new cto is an idiot, rest of leadership was replaced by yes-men, and the push for AI-first is being a disaster. c levels pretend they care about reliability but pressure teams to constantly ship, cto vibe codes terraform changes without warning anyone, and it's overall a bigger and bigger mess

even the blog, that used to be a respected source of technical content, has morphed into a garbage fire of slop and vaporware announcements since jgc left.

sebmellen3mo ago

Do you feel that Matthew Prince is still technically active/informed? I've interacted with him in the past and he seemed relatively technically grounded, but that doesn't seem as true these days.

3rodents3mo ago

https://xcancel.com/eastdakota/status/2025215495142564177

https://xcancel.com/eastdakota/status/2025221270061580453

Rather than be driven by something rational like building a great product or making lots of money he is apparently driven by a desperate fear of being a dinosaur.

Regardless of how competent he is or isn’t as a technologist, a leader leading with fear is a recipe for disaster.

1 more reply

goalieca3mo ago

I’ve had a lot of problems lately. Basic things are failing and it’s like product isn’t involved at all in the dash. What’s worse? The support.. the chat is the buggiest thing I’ve ever seen.

slophater3mo ago

don't worry, if it gets much worse the ceo will just throw all of support under the bus again. it will surely get better.

1 more reply

__turbobrew__3mo ago

You know what they say, shit rolls downhill. I don't personally know the CEO, but the feeling I have got from their public fits on social media doesn't instill confidence.

If I was a CF customer I would be migrating off now.

nanankcornering3mo ago

do you care that much about leadership when using a service? even I dont know gcp's c-level, aws's c-level, even vercel's c-level. only know rauchg.

i think i care much more about our SLAs (if any)

a24446ff873mo ago

GSD! GSD!! ship! ship! ship!

**everything breaks**

...

**everything breaks again**

oh fuck! Code Orange! I repeat, Code Orange! we need to rebuild trust(R)(TM)! we've let our customers down!

...

**everything breaks again**

Code Orangier! I repeat, Code Orangier!

slophater3mo ago

exactly. recently "if the cto is shipping more than you, you're doing something wrong"

cto can't even articulate a sentence without passing it through an LLM, and instead of doing his job he's posting the stupidest shit to his personal bootlicking chat channel. I cringe every time at the brown-nosers that inhabit that hovel.

no words for what the product org is becoming too. they should take their own advice a bit further and just replace all the leadership with an LLM, it would be cheaper and it's the same shit in practice

1 more reply

slophater3mo ago

amazing how my comment was flagged in 30 seconds... keep bootlicking

anurag3mo ago

The one redeeming feature of this failure is staged rollouts. As someone advertising routes through CF, we were quite happy to be spared from the initial 25%.

jaboostin3mo ago

Hindsight is 20/20 but why not dry run this change in production and monitor the logs/metrics before enabling it? Seems prudent for any new “delete something in prod” change.

Bender3mo ago

Old tech could work around these outages. Set up GSLB at a DNS provider that does health checks or perform your own health checks to both origin and CDN's and use API's to change DNS. If the origin servers are OK and the CDN is not, automatically change DNS to a different CDN. There should be multiple probes that form a consensus. This process assumes one is managing the configurations of their CDN's through code and API so that one can set up and tear down any number of CDN's on a whim.

That does mean having contracts with more than one CDN provider however the cost should be negotiated based on monthly volume. i.e. the CDN with the most uptime gets the most money. If an existing CDN under contract refuses to negotiate then move some non critical path services to them and let that contract expire. Instate a company wide policy to never return to a vendor if their contract was intentionally not renewed.

himata41133mo ago

This blog post is inaccurate, the prefixes were being revoked over and over - to keep your prefixes advertised you had to have a script that would readd them or else it would be withdrawn again. The way they seemed to word it is really dishonest.

dilyevsky3mo ago

Lmao, iirc long time ago Google's internal system had the same exact bug (treating empty as "all" in the delete call) that took down all their edges. Surprisingly there was little impact as traffic just routed through the next set of proxies.

boarush3mo ago

While neither am I nor the company I work for directly impacted by this outage, I wonder how long can Cloudflare take these hits and keep apologizing for it. Truly appreciate them being transparent about it, but businesses care more about SLAs and uptime than the incident report.

llama0523mo ago

I’ll take clarity and actual RCAs than Microsoft’s approach of not notifying customers and keeping their status page green until enough people notice.

One thing I do appreciate about cloudflare is their actual use of their status page. That’s not to say these outages are okay. They aren’t. However I’m pretty confident in saying that a lot of providers would have a big paper trail of outages if they were more honest to the same degree or more so than cloudflare. At least from what I’ve noticed, especially this year.

boarush3mo ago

Azure straight up refuses to show me if there's even an incident even if I can literally not access shit.

But last few months has been quite rough for Cloudflare, and a few outages on their Workers platform that didn't quite make the headlines too. Can't wait for Code Orange to get to production.

jacquesm3mo ago

Bluntly: they expended that credit a while ago. Those that can will move on. Those that can't have a real problem.

As for your last sentence:

Businesses really do care about the incident reports because they give good insight into whether they can trust the company going forward. Full transparency and a clear path to non-repetition due to process or software changes are called for. You be the judge of whether or not you think that standard has been met.

boarush3mo ago

I might be looking at it differently, but aren't decisions over a certain provider of service being made by the management. Incident reports don't ever reach there in my experience.

jacquesm3mo ago

Every company that relies on their suppliers and that has mature management maintains internal supplier score cards as part of their risk assessment, more so for suppliers that are hard to find replacements for. They will of course all have their of thresholds for action but what has happened in the last period with CF exceeds most of the thresholds for management comfort that I'm aware of.

Incident reports themselves are highly technical, so will not reach management because they are most likely simply not equipped to deal with them. But the CTOs of the companies will take notice, especially when their own committed SLAs are endangered and their own management asks them for an explanation. CF makes them all look bad right now.

samrus3mo ago

In my experience, the gist of it does reach management when its an existing vendor. Especially if management is tech literate

Becuase management wants to know why the graphs all went to zero, and the engineers have nothing else to do but relay the incident report.

This builds a perception for management of the vendor, and if the perception is that the vendor doesnt tell them shit or doesnt even seem to know theres an outage, then management can decide to shift vendors

VirusNewbie3mo ago

If you track large SaaS and Cloud uptime, it seem to correlate pretty highly with compensation for big companies. Is cloudflare getting top talent?

bombcar3mo ago

Based on IPO date and lockups, I suspect top talent is moving on.

abalone3mo ago

The code they posted doesn't quite explain the root cause. This is a good study case for resilient API design and testing.

They said their /v1/prefixes endpoint has this snippet:

  if v := req.URL.Query().Get("pending_delete"); v != "" {
      // ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
      prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
      
      [..snip..]
  }

What's implied but not shown here is that endpoint normally returns all prefixes. They modified it to return just those pending deletion when passing a pending_delete query string parameter.

The immediate problem of course is this block will never execute if pending_delete has no value:

  /v1/prefixes?pending_delete   <-- doesn't execute block

This is because Go defaults query params to empty strings and the if statement skips this case. Which makes you wonder, what is the value supposed to be? This is not explained. If it's supposed to be:

  /v1/prefixes?pending_delete=true   <--- executes block

Then this would work, but the implementation fails to validate this value. From this you can infer that no unit test was written to exercise the value:

  /v1/prefixes?pending_delete=false   <-- wrongly executes block

The post explains "initial testing and code review focused on the BYOIP self-service API journey." We can reasonably guess their tests were passing some kind of "true" value for the param, either explicitly or using a client that defaulted param values. What they didn't test was how their new service actually called it.

So, while there's plenty to criticize on the testing front, that's first and foremost a basic failure to clearly define an API contract and implement unit tests for it.

But there's a third problem, in my view the biggest one, at the design level. For a critical delete path they chose to overload an existing endpoint that defaults to returning everything. This was a dangerous move. When high stakes data loss bugs are a potential outcome, it's worth considering more restrictive API that is harder to use incorrectly. If they had implemented a dedicated endpoint for pending deletes they would have likely omitted this default behavior meant for non-destructive read paths.

In my experience, these sorts of decisions can stem from team ownership differences. If you owned the prefixes service and were writing an automated agent that could blow away everything, you might write a dedicated endpoint for it. But if you submitted a request to a separate team to enhance their service to returns a subset of X, without explaining the context or use case very much, they may be more inclined to modify the existing endpoint for getting X. The lack of context and communication can end up missing the risks involved.

Final note: It's a little odd that the implementation uses Go's "if with short statement" syntax when v is only ever used once. This isn't wrong per se but it's strange and makes me wonder to what extent an LLM was involved.

PunchyHamster3mo ago

> But there's a third problem, in my view the biggest one, at the design level. For a critical delete path they chose to overload an existing endpoint that defaults to returning everything. This was a dangerous move. When high stakes data loss bugs are a potential outcome, it's worth considering more restrictive API that is harder to use incorrectly. If they had implemented a dedicated endpoint for pending deletes they would have likely omitted this default behavior meant for non-destructive read paths.

Or POST endpoint, with client side just sending serialized object as query rather than relying that the developer remembers the magical query string.

fjoaasdfas3mo ago

yikes: https://github.com/golang/go/blob/master/src/net/url/url.go#...

maybe go can do (string v, ok bool) for this or add proper sum types...

ssiddharth3mo ago

The eternal tech outage aphorism: It's always DNS, except for when it's BGP.

subscribed3mo ago

You could argue BGP is like DNS for IPs :)

est3mo ago

bitbucket was done for a while as well. Seems no one noticed.

wa0083mo ago

This transparent report can earn my trust

NooneAtAll33mo ago

again?

dryarzeg3mo ago

DaaS - Downtime as a Service©

Just joking, no offence :)

logicchains3mo ago

DaaS is good ja

henning3mo ago

[flagged]

sp00chy3mo ago

that’s my feeling also. We will get this more and more in future.

djfobbz3mo ago

I'm honestly amazed that a company CF's size doesn't have a neat little cluster of Mac Minis running OpenClaw and quietly taking care of this for them.

user2057383mo ago

They should have rewritten this code in Rust using these brilliant language models. /jk

j / k navigate · click thread line to collapse

125 comments

kgeist3mo ago

It's something we debated in our team: if there's an API that returns data based on filters, what's the better behavior if no filters are provided - return everything or return nothing?

Another option, if you don't want pagination, is to have a separate method named accordingly, like ListAllObjects, without any filters.

alemanek3mo ago

1 more reply

Thaxll3mo ago

Neither of your options are good, the first question you need to ask is that is the filter optional or not ( this is a contract / API question ).

If not optional then return 400, otherwise return all the results ( and have pagination ).

You should always have pagination in an API.

Philip-J-Fry3mo ago

>allowing API users to return everything at once can be a problem both for our server (lots of data in RAM when fetching from the DB => OOM, and additional stress on the DB)

You can limit stress on RAM by streaming the data. You should ideally stream rows for any large dataset. Otherwise, like you say you are loading the entire thing into RAM.

jiggawatts3mo ago

Not to mention the latency reduction!

Buffering up the entire data set before encoding it to JSON and sending it is one of the biggest sources of latency in API based software. Streaming can get latencies down to tens of microseconds!

qwertyuiop_3mo ago

MobileVet3mo ago

PunchyHamster3mo ago

But that query had parameter. They just fucked up parsing it

est3mo ago

> to have a separate method named accordingly, like ListAllObjects, without any filters

For me it's like `filter1=*`

CommonGuy3mo ago

Insufficient mock data in the staging environment? Like no BYOIP prefixes at all? Since even one prefix should have shown that it would be deleted by that subtask...

zmj3mo ago

dabinat3mo ago

I think Cloudflare does not sufficiently test lesser-used options. I lurk in the R2 Discord and a lot of users seem to have problems with custom domains.

asciii3mo ago

It was also merged 15 days prior to production release...however, you're spot on with the empty test. That's a basic scenario that if it returned all...is like oh no.

suhputt3mo ago

my guess is the company is rotting from the inside and drowning in tech debt

martinald3mo ago

Just crazy. Why does a staging environment matter? They should be running some integration tests against eg an in memory database for these kinds of tasks surely?

otar3mo ago

Reliability was/is CF's label.

It's alarming already. Too many outages in the past months. CF should fix it, or it becomes unacceptable and people will leave the platform.

I really hope they will figure things out.

tallytarik3mo ago

We’re still waiting on a solution for https://www.cloudflarestatus.com/incidents/391rky29892m (which actually started a month earlier than the incident reports)

argestes3mo ago

slothsarecool3mo ago

There are no alternatives, and those alternatives that did exist back in the day, had to shut down due to either going out of business or not being able to keep a paygo model.

Not everybody needs cloudflare, but those that need it and aren't major enterprises, have no other option.

Sanzig3mo ago

Bunny.net? Doesn't have near the same feature set as Cloudflare, but the essentials are there and you can easily pay as you go with a credit card.

1 more reply

pocksuppet3mo ago

Lots of people who think they need Cloudflare don't. What are you using it for?

1 more reply

arcatech3mo ago

Do you not feel concern about you and everybody else deciding to put ALL of their eggs into one basket like this?

ranger_danger3mo ago

I would bet money that most people who use CF now are already hosting their endpoints at a single provider. I don't think most people care until it actually becomes enough of a problem.

esseph3mo ago

Like AWS/GCP/Azure?

alansaber3mo ago

Not sure why everyone is complaining, new MCP features are more important than uptime

NinjaTrance3mo ago

The irony is that the outage was caused by a change from the "Code Orange: Fail Small initiative".

They definitely failed big this time.

vimda3mo ago

One has to wonder when the board realises Dane was a bad replacement for JGC. These outages are getting ridiculous

blibble3mo ago

is this blog post LLM generated?

the explanation makes no sense:

client:

     resp, err := d.doRequest(ctx, http.MethodGet, `/v1/prefixes?pending_delete`, nil)

server:

    if v := req.URL.Query().Get("pending_delete"); v != "" {
        // ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
        prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
        if err != nil {
            api.RenderError(ctx, w, ErrInternalError)
            return
        }

        api.Render(ctx, w, http.StatusOK, renderIPPrefixAPIResponse(prefixes, nil))
        return
    }

even if the client had passed a value it would have still done exactly the same thing, as the value of "v" (or anything from the request) is not used in that block

PunchyHamster3mo ago

better explanation here https://news.ycombinator.com/item?id=47106852

but in short they are changing whether string is empty, and query string "pending_delete" is same as "pending_delete=" and will return empty

Or, if they specified `/v1/prefixes?pending_delete=potato` it would return "correct" list of objects to delete

Or in other words "Go have types safety, fuck it, let's use strings like in '90s PHP apps instead"

lenkite3mo ago

If Go supported Optional's, this bug would not have surfaced.

PunchyHamster3mo ago

You can probably make it with generics.

But if I was given option to steal one feature out of other languages it would be enums and resulting Result/Optional from Rust

bstsb3mo ago

doesn't look AI-generated. even if they have made a mistake, it's probably just from the rush of getting a postmortem out prior to root cause analysis

bretthoerner3mo ago

> even if the client had passed a value it would have still done exactly the same thing, as the value of "v" (or anything from the request) is not used in that block

If they passed in any value, they would have entered the block and returned early with the results of FetchPrefixesPendingDeletion.

From the post:

> this was implemented as part of a regularly running sub-task that checks for BYOIP prefixes that should be removed, and then removes them.

They expected to drop into the block of code above, but since they didn't, they returned all routes.

blibble3mo ago

okay so the code which returned everything isn't there

makes sense now

but is that explanation is even worse

because that means the code path was never tested?

jbxntuehineoh3mo ago

or they tested it, but not with a dataset that contained prefixes not pending deletion

asuffield2mo ago

(Other comments have explained the bug so I won't repeat them)

subscribed3mo ago

That's weird. They only removed some 6 of our prefixes out of perhaps 40 we have with them, so something seems off in this explanation.

himata41133mo ago

yep, no mention that re-advertised prefixes would be withdrawn again as well during the entire impact even after they shut it down.

atty3mo ago

Icathian3mo ago

It is not. They went about 5 years without one of these, and had a handful over the last 6 months. They're really going to need to figure out what's going wrong and clean up shop.

NinjaTrance3mo ago

Engineers have been vibe coding a lot recently...

jsheard3mo ago

4 more replies

dakiol3mo ago

7 more replies

Ylpertnodi3mo ago

Typo: "shop", should have been with an 'el'.

(: phonetically, because 'l's are hard to read.

lysace3mo ago

It has been roughly speaking five and a half years since the IPO. The original CTO (John Graham-Cumming) left about a year ago.

jacquesm3mo ago

Even so, it is a strong reminder not to rely on any one vendor for critical stuff, in case that wasn't clear enough yet.

lysace3mo ago

You can coast for quite some time (5-10 years?) if you really lean into it (95% of the knowledge of maintaining and scaling the stack is there in the minds of hundreds of developers).

Seems like Matthew Prince didn't choose that route.

1 more reply

dazc3mo ago

I wondered what happened to him?

jgrahamc3mo ago

I am reading HN.

1 more reply

brcmthrowaway3mo ago

He's on a yacht somewhere

1 more reply

Cipater3mo ago

He's still a member of the board though.

dazc3mo ago

Launching a new service every 5 minutes is obviously stretching their resources.

Betelbuddy3mo ago

Cloudflare Outages are as predictable, as the Sun coming up tomorrow. Its their engineering culture.

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

candiddevmike3mo ago

Wait till you see the drama around their horrible terraform provider update/rewrite:

https://github.com/cloudflare/terraform-provider-cloudflare/...

slophater3mo ago

even the blog, that used to be a respected source of technical content, has morphed into a garbage fire of slop and vaporware announcements since jgc left.

sebmellen3mo ago

Do you feel that Matthew Prince is still technically active/informed? I've interacted with him in the past and he seemed relatively technically grounded, but that doesn't seem as true these days.

3rodents3mo ago

https://xcancel.com/eastdakota/status/2025215495142564177

https://xcancel.com/eastdakota/status/2025221270061580453

Rather than be driven by something rational like building a great product or making lots of money he is apparently driven by a desperate fear of being a dinosaur.

Regardless of how competent he is or isn’t as a technologist, a leader leading with fear is a recipe for disaster.

1 more reply

goalieca3mo ago

slophater3mo ago

don't worry, if it gets much worse the ceo will just throw all of support under the bus again. it will surely get better.

1 more reply

__turbobrew__3mo ago

You know what they say, shit rolls downhill. I don't personally know the CEO, but the feeling I have got from their public fits on social media doesn't instill confidence.

If I was a CF customer I would be migrating off now.

nanankcornering3mo ago

do you care that much about leadership when using a service? even I dont know gcp's c-level, aws's c-level, even vercel's c-level. only know rauchg.

i think i care much more about our SLAs (if any)

a24446ff873mo ago

GSD! GSD!! ship! ship! ship!

**everything breaks**

...

**everything breaks again**

oh fuck! Code Orange! I repeat, Code Orange! we need to rebuild trust(R)(TM)! we've let our customers down!

...

**everything breaks again**

Code Orangier! I repeat, Code Orangier!

slophater3mo ago

exactly. recently "if the cto is shipping more than you, you're doing something wrong"

1 more reply

slophater3mo ago

amazing how my comment was flagged in 30 seconds... keep bootlicking

anurag3mo ago

The one redeeming feature of this failure is staged rollouts. As someone advertising routes through CF, we were quite happy to be spared from the initial 25%.

jaboostin3mo ago

Hindsight is 20/20 but why not dry run this change in production and monitor the logs/metrics before enabling it? Seems prudent for any new “delete something in prod” change.

Bender3mo ago

himata41133mo ago

dilyevsky3mo ago

boarush3mo ago

llama0523mo ago

I’ll take clarity and actual RCAs than Microsoft’s approach of not notifying customers and keeping their status page green until enough people notice.

boarush3mo ago

Azure straight up refuses to show me if there's even an incident even if I can literally not access shit.

But last few months has been quite rough for Cloudflare, and a few outages on their Workers platform that didn't quite make the headlines too. Can't wait for Code Orange to get to production.

jacquesm3mo ago

Bluntly: they expended that credit a while ago. Those that can will move on. Those that can't have a real problem.

As for your last sentence:

boarush3mo ago

I might be looking at it differently, but aren't decisions over a certain provider of service being made by the management. Incident reports don't ever reach there in my experience.

jacquesm3mo ago

samrus3mo ago

In my experience, the gist of it does reach management when its an existing vendor. Especially if management is tech literate

Becuase management wants to know why the graphs all went to zero, and the engineers have nothing else to do but relay the incident report.

VirusNewbie3mo ago

If you track large SaaS and Cloud uptime, it seem to correlate pretty highly with compensation for big companies. Is cloudflare getting top talent?

bombcar3mo ago

Based on IPO date and lockups, I suspect top talent is moving on.

abalone3mo ago

The code they posted doesn't quite explain the root cause. This is a good study case for resilient API design and testing.

They said their /v1/prefixes endpoint has this snippet:

  if v := req.URL.Query().Get("pending_delete"); v != "" {
      // ignore other behavior and fetch pending objects from the ip_prefixes_deleted table
      prefixes, err := c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
      
      [..snip..]
  }

What's implied but not shown here is that endpoint normally returns all prefixes. They modified it to return just those pending deletion when passing a pending_delete query string parameter.

The immediate problem of course is this block will never execute if pending_delete has no value:

  /v1/prefixes?pending_delete   <-- doesn't execute block

  /v1/prefixes?pending_delete=true   <--- executes block

Then this would work, but the implementation fails to validate this value. From this you can infer that no unit test was written to exercise the value:

  /v1/prefixes?pending_delete=false   <-- wrongly executes block

So, while there's plenty to criticize on the testing front, that's first and foremost a basic failure to clearly define an API contract and implement unit tests for it.

PunchyHamster3mo ago

Or POST endpoint, with client side just sending serialized object as query rather than relying that the developer remembers the magical query string.

fjoaasdfas3mo ago

yikes: https://github.com/golang/go/blob/master/src/net/url/url.go#...

maybe go can do (string v, ok bool) for this or add proper sum types...

ssiddharth3mo ago

The eternal tech outage aphorism: It's always DNS, except for when it's BGP.

subscribed3mo ago

You could argue BGP is like DNS for IPs :)

est3mo ago

bitbucket was done for a while as well. Seems no one noticed.

wa0083mo ago

This transparent report can earn my trust

NooneAtAll33mo ago

again?

dryarzeg3mo ago

Just joking, no offence :)

logicchains3mo ago

DaaS is good ja

henning3mo ago

[flagged]

sp00chy3mo ago

that’s my feeling also. We will get this more and more in future.

djfobbz3mo ago

I'm honestly amazed that a company CF's size doesn't have a neat little cluster of Mac Minis running OpenClaw and quietly taking care of this for them.

user2057383mo ago

They should have rewritten this code in Rust using these brilliant language models. /jk

j / k navigate · click thread line to collapse