CrowdStrike Official RCA is now out [pdf] (opens in new tab)

(crowdstrike.com)

120 pointsSarkie1y ago36 comments

36 comments

Ukv1y ago

> In summary, it was the confluence of these issues that resulted in a system crash: [...] the lack of a specific test for non-wildcard matching criteria in the 21st field.

I feel they focus a lot on their content validator lacking a check to catch this specific error (probably since that sounds like a more understandable oversight) when the more glaring issue is that they didn't try actually running this template instance on even a single machine, which would've instantly revealed the issue.

Even for amateur software with no unit/integration tests, the developer will still have typically ran it on their own machine to see it working. Here CrowdStrike seem to have been flying blind, just praying new template instances work if they pass the validation checks.

They do at least promise to "ensure that every new Template Instance is tested" further down.

grumple1y ago

Absolutely. This is the number one issue I see causing problems with devs on my teams. It is extremely simple to test your damn work. Smoke test it. Make sure the damn machine boots. Make sure the app runs.

This is covered in part by a staged deployment... but that's just having your users test for you. Where's the automated integration test, or just the boot test?

teyc1y ago

It doesn't even cover the barest of organisational root cause. How are they planning to do defense in depth and prevent any internal threat actor from wedging every machine in the world?

zer00eyz1y ago

Crowdstrike takes it self seriously, for a security company. That means don't ask questions of the experts.

Everyone else sees these services as the patsy when the problem happens.

From a technical perspective it's a hot mess (you are spot on). But business says "everything is fine, this is fine, carry on", because it meets their goal of CYA.

mrguyorama1y ago

That's a lot of words to say "We did not test a file that gets ingested by a kernel level program, not even once"

At no point did they deploy this file to a computer they owned and attempted to boot it. They purposely decided to deploy behavior to every computer they could without even once making sure it wouldn't break from something stupid.

Are these people fucking nuts?

I do more testing than this and I might be incompetent. Also nothing I touch will kill millions of PCs. I get having pressure put on you from above, I get being encouraged to cut corners so some shithead can check off a box on his yearly review and make more money while stiffing you on your raise, I get making mistakes.

But like, fuck man, come on.

pjsg1y ago

I think it is worse than that. When I make a change to some code or config, I'll run it locally to make sure that the change has the effect that I want. I know that we are human and that bugs occasionally appear in code. But what I can't understand is that the human who initiated this change decided not to see if it actually did what they wanted it to do.

I've made changes on personal projects that I thought were simple, and yet broke stuff. But CrowdStrike is a multi-billion dollar company -- how can it be possible to have such a broken process. Their RCA document was interesting, but didn't cover any of the interesting issues. It seems that they don't know about the 5 Whys process (https://en.wikipedia.org/wiki/Five_whys) or decided that those answers were so embarrassing that they had to omit them.

darylteo1y ago

> When I make a change to some code or config, I'll run it locally to make sure that the change has the effect that I want.

It's not uncommon for devs to be working against outdated databases / config dumps. Certainly bad practice but when devs have the option of being lazy vs doing chores, they will pick the path of less resistance.

> But what I can't understand is that the human who initiated this change decided not to see if it actually did what they wanted it to do.

We're assuming that the person who changed the code also made the choice to initiate the rollout. They are 2 separate actions which can be made by separate individuals and could also involve many multiple steps in between, each undertaken by a separate individual as well.

Distance from Prod does introduce a sense of malaise and complacency, I've found.

darylteo1y ago

The whole thing smells of silo'ed teams syndrome.

Team 1 tells Team 2 that the schema is updating.

Team 2 updates their schema.

Team 2 tests against updated schema

All green in test.

Team 1 doesn't actually follow the schema.

Deployment fails.

---

It's really hard to assign blame, but I'd put more blame on Team 2 for not being defensive with their inputs enough.

As we all know there are greater issues with their deployment pipelines (lack of canaries, phased rollouts etc.) but no point going over those in this context.

tantalor1y ago

Leeeroy Jenkiiiins!

ivanjermakov1y ago

They should've read "parse, not validate": https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

thegreenroom1y ago

Thanks that was a good read.

kiririn1y ago

A lot of mitigation actions but nothing to really stop it happening again: a fail safe system in their boot start driver. Bad programming and QA caused the issue, but bad design allowed it to happen

simiones1y ago

I think the QA issues are by far the most important part. A security component of this type, by its nature, has to be able to prevent your computer from doing anything at all, since any part of userspace (at least) could be compromised.

The "fail safe" for a security component is in fact to prevent any user space code from running at all - better that than having it actively harm other systems, exfiltrate data, destroy connected hardware etc. So, no amount of clever design can prevent the CrowdStrike sensor from nuking your system if bad security rules get deployed.

For example, if a bad definition file makes it think that the legit libc or win32 libraries are compromised, it should prevent any userspace program from running, which is just as destructive as failing during boot.

That is why appropriate QA is critical for this type of program. I would expect any definition update of any kind to be tested on dozens of systems with a wide variety of Windows configurations and known-good software far before ever being deployed to any customer system. It seems that CrowdStrike thought the exact opposite of this, and in fact their customers were the first to ever run their new code end-to-end, not the last...

acdha1y ago

> So, no amount of clever design can prevent the CrowdStrike sensor from nuking your system if bad security rules get deployed.

This is too binary a way to think about a complex system. Availability is also a security goal so we shouldn’t cavalierly trade it for minor risks which are mostly edge cases.

For example, say that the fail-safe was an old, old idea where they kept the second most recent version, and if the system failed to start or crashed repeatedly, it automatically rolled back to the last known good version. That turns this kind of problem into at most a reboot – a huge win every customer would have taken - and the only case it would introduce a vulnerability is if there’s an active attack which only the latest rules will block which is so virulent that the number of systems approximates the number who’ll be affected by a bad update. That’s an unlikely set of events, especially because there’s a really tight window where such a fast-spreading attack wouldn’t have compromised the host before CrowdStrike could ship the update.

Another variation of that idea: any time the system fails to start repeatedly, the service blocks processes other than its updater so normal apps aren’t exposed as potential vectors but the system can self-heal in most cases.

l00tr1y ago

famous windows "guru" Alex Ionescu was their main kernel architector for long time, funny he didn't comment anything about that fail

Terretta1y ago

Add a new threat actor to the list, those pesky parameter counts actively trying to evade detection:

"This parameter count mismatch evaded multiple layers of build validation and testing, as it was not discovered during the sensor release testing process, the Template Type (using a test Template Instance) stress testing or the first several successful deployments of IPC Template Instances in the field."

Curious that csagent.sys isn't mentioned until last page, p. 12:

"csagent.sys is CrowdStrike’s file system filter driver, a type of kernel driver that registers with components of the Windows operating system…"

darylteo1y ago

Well I guess I should post the obligatory

> Some people, when confronted with a problem, think

> “I know, I’ll use regular expressions.”

> Now they have two problems.

ChrisArchitect1y ago

Cleaner link: https://www.crowdstrike.com/blog/channel-file-291-rca-availa...

gz51y ago

Note: this was distributed to their customers today

caust1c1y ago

Is it just me or does it seem like this change simply wasn't tested beyond a simple unit test?

acdha1y ago

The big thing I was wondering is what their coverage analysis is like. I can see a developer missing this in a hurry, but where’s the review or second-order analysis? Tons of projects with far less importance monitor branch coverage, use fuzz testing and path analysis tools, etc. and while it’s not trivial to test a kernel driver it’s not _that_ hard, especially when you have the resources of a company valued in the tens of billions which allegedly specializes in exactly this kind of work.

The thing I’ve been thinking about are all of the assurances they made about SDLC, testing, secure development practices, etc. They have so many huge customers in regulated industries, government, etc. that they completed almost every certification in existence and seeing this really raises questions about how those assertions were reviewed.

darylteo1y ago

I posit that there are multiple disparate teams involved.

Team 1 tells Team 2 that the schema is updating.

Team 2 updates their schema.

Team 2 tests against updated schema (which would be a test file)

All green in test.

Team 1 doesn't actually follow the schema.

Deployment fails.

armalis1y ago

It’s worse than that. They updated the schema, and tested it with previous data that does not exercise the new parameter. Tests are passing. When they go and actually use the new parameter, it crashes.

The new schema was improperly tested (among a list of other failures).

sylens1y ago

My thoughts as well.

portugalportuga1y ago

kinda sounds like this was a regex bug?

> The selection of data in the channel file was done manually and included a regex wildcard matching criterion in the 21st field for all Template Instances, meaning that execution of these tests during development and release builds did not expose the latent out-of-bounds read in the Content Interpreter when provided with 20 rather than 21 inputs.

studmuffin6501y ago

Sounds more like a off by 1 bug that was hidden by regexs if I'm reading correctly

darylteo1y ago

Very easily hidden. Something obtuse like

    (.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)

or even this

    (.{4})(.{7})(.{3})(.{6})(.{9})(.{4})(.{7})(.{3})(.{6})(.{9})(.{4})(.{7})(.{3})(.{6})(.{9})(.{4})(.{7})(.{3})(.{6})(.{9})(.{1})

would simply fail to match.

And I wouldn't necessarily blame the developer in either scenario - they received a card that says "hey the channel file will now have an extra field in it's schema"... noone said "btw it's optional".

Calling it a "first year programming mistake" like I'm reading in some media is somewhat incendiary. I see unmarshalling errors happen all the time.

The forest that we must not miss is the kernel-level driver simply dies with no error recovery and bricks the system.

1 more reply

cptskippy1y ago

Yeah, my read was that they changed an interface to include an optional parameter but never actually tested the underlying code by providing said optional parameter.

The bug in clients (sensors) wasn't due to regex, the regex was in their integration unit testing which also had a bug and was never supplying the 21st parameter to the client code.

1 more reply

simiones1y ago

I don't think so. As far as I understood this, the wildcard match was basically considered a no-op (since anything matches, they probably optimized by not even attempting the match), and so that 21st field was never provided to their Content Interpeter, so it never crashed before. The first time they actually added a non-wildcard match, the Content Interpeter was actually asked to check the 21st field as well, and it crashed because it only had an array of 20 items.

j / k navigate · click thread line to collapse

36 comments

Ukv1y ago

> In summary, it was the confluence of these issues that resulted in a system crash: [...] the lack of a specific test for non-wildcard matching criteria in the 21st field.

They do at least promise to "ensure that every new Template Instance is tested" further down.

grumple1y ago

This is covered in part by a staged deployment... but that's just having your users test for you. Where's the automated integration test, or just the boot test?

teyc1y ago

It doesn't even cover the barest of organisational root cause. How are they planning to do defense in depth and prevent any internal threat actor from wedging every machine in the world?

zer00eyz1y ago

Crowdstrike takes it self seriously, for a security company. That means don't ask questions of the experts.

Everyone else sees these services as the patsy when the problem happens.

From a technical perspective it's a hot mess (you are spot on). But business says "everything is fine, this is fine, carry on", because it meets their goal of CYA.

mrguyorama1y ago

That's a lot of words to say "We did not test a file that gets ingested by a kernel level program, not even once"

Are these people fucking nuts?

But like, fuck man, come on.

pjsg1y ago

darylteo1y ago

> When I make a change to some code or config, I'll run it locally to make sure that the change has the effect that I want.

> But what I can't understand is that the human who initiated this change decided not to see if it actually did what they wanted it to do.

Distance from Prod does introduce a sense of malaise and complacency, I've found.

darylteo1y ago

The whole thing smells of silo'ed teams syndrome.

Team 1 tells Team 2 that the schema is updating.

Team 2 updates their schema.

Team 2 tests against updated schema

All green in test.

Team 1 doesn't actually follow the schema.

Deployment fails.

---

It's really hard to assign blame, but I'd put more blame on Team 2 for not being defensive with their inputs enough.

As we all know there are greater issues with their deployment pipelines (lack of canaries, phased rollouts etc.) but no point going over those in this context.

tantalor1y ago

Leeeroy Jenkiiiins!

ivanjermakov1y ago

They should've read "parse, not validate": https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

thegreenroom1y ago

Thanks that was a good read.

kiririn1y ago

A lot of mitigation actions but nothing to really stop it happening again: a fail safe system in their boot start driver. Bad programming and QA caused the issue, but bad design allowed it to happen

simiones1y ago

acdha1y ago

> So, no amount of clever design can prevent the CrowdStrike sensor from nuking your system if bad security rules get deployed.

This is too binary a way to think about a complex system. Availability is also a security goal so we shouldn’t cavalierly trade it for minor risks which are mostly edge cases.

l00tr1y ago

famous windows "guru" Alex Ionescu was their main kernel architector for long time, funny he didn't comment anything about that fail

Terretta1y ago

Add a new threat actor to the list, those pesky parameter counts actively trying to evade detection:

Curious that csagent.sys isn't mentioned until last page, p. 12:

"csagent.sys is CrowdStrike’s file system filter driver, a type of kernel driver that registers with components of the Windows operating system…"

darylteo1y ago

Well I guess I should post the obligatory

> Some people, when confronted with a problem, think

> “I know, I’ll use regular expressions.”

> Now they have two problems.

ChrisArchitect1y ago

Cleaner link: https://www.crowdstrike.com/blog/channel-file-291-rca-availa...

gz51y ago

Note: this was distributed to their customers today

caust1c1y ago

Is it just me or does it seem like this change simply wasn't tested beyond a simple unit test?

acdha1y ago

darylteo1y ago

I posit that there are multiple disparate teams involved.

Team 1 tells Team 2 that the schema is updating.

Team 2 updates their schema.

Team 2 tests against updated schema (which would be a test file)

All green in test.

Team 1 doesn't actually follow the schema.

Deployment fails.

armalis1y ago

The new schema was improperly tested (among a list of other failures).

sylens1y ago

My thoughts as well.

portugalportuga1y ago

kinda sounds like this was a regex bug?

studmuffin6501y ago

Sounds more like a off by 1 bug that was hidden by regexs if I'm reading correctly

darylteo1y ago

Very easily hidden. Something obtuse like

    (.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)

or even this

    (.{4})(.{7})(.{3})(.{6})(.{9})(.{4})(.{7})(.{3})(.{6})(.{9})(.{4})(.{7})(.{3})(.{6})(.{9})(.{4})(.{7})(.{3})(.{6})(.{9})(.{1})

would simply fail to match.

Calling it a "first year programming mistake" like I'm reading in some media is somewhat incendiary. I see unmarshalling errors happen all the time.

The forest that we must not miss is the kernel-level driver simply dies with no error recovery and bricks the system.

1 more reply

cptskippy1y ago

Yeah, my read was that they changed an interface to include an optional parameter but never actually tested the underlying code by providing said optional parameter.

The bug in clients (sensors) wasn't due to regex, the regex was in their integration unit testing which also had a bug and was never supplying the 21st parameter to the client code.

1 more reply

simiones1y ago

j / k navigate · click thread line to collapse