I feel they focus a lot on their content validator lacking a check to catch this specific error (probably since that sounds like a more understandable oversight) when the more glaring issue is that they didn't try actually running this template instance on even a single machine, which would've instantly revealed the issue.
Even for amateur software with no unit/integration tests, the developer will still have typically ran it on their own machine to see it working. Here CrowdStrike seem to have been flying blind, just praying new template instances work if they pass the validation checks.
They do at least promise to "ensure that every new Template Instance is tested" further down.
This is covered in part by a staged deployment... but that's just having your users test for you. Where's the automated integration test, or just the boot test?
Everyone else sees these services as the patsy when the problem happens.
From a technical perspective it's a hot mess (you are spot on). But business says "everything is fine, this is fine, carry on", because it meets their goal of CYA.
At no point did they deploy this file to a computer they owned and attempted to boot it. They purposely decided to deploy behavior to every computer they could without even once making sure it wouldn't break from something stupid.
Are these people fucking nuts?
I do more testing than this and I might be incompetent. Also nothing I touch will kill millions of PCs. I get having pressure put on you from above, I get being encouraged to cut corners so some shithead can check off a box on his yearly review and make more money while stiffing you on your raise, I get making mistakes.
But like, fuck man, come on.
I've made changes on personal projects that I thought were simple, and yet broke stuff. But CrowdStrike is a multi-billion dollar company -- how can it be possible to have such a broken process. Their RCA document was interesting, but didn't cover any of the interesting issues. It seems that they don't know about the 5 Whys process (https://en.wikipedia.org/wiki/Five_whys) or decided that those answers were so embarrassing that they had to omit them.
It's not uncommon for devs to be working against outdated databases / config dumps. Certainly bad practice but when devs have the option of being lazy vs doing chores, they will pick the path of less resistance.
> But what I can't understand is that the human who initiated this change decided not to see if it actually did what they wanted it to do.
We're assuming that the person who changed the code also made the choice to initiate the rollout. They are 2 separate actions which can be made by separate individuals and could also involve many multiple steps in between, each undertaken by a separate individual as well.
Distance from Prod does introduce a sense of malaise and complacency, I've found.
Team 1 tells Team 2 that the schema is updating.
Team 2 updates their schema.
Team 2 tests against updated schema
All green in test.
Team 1 doesn't actually follow the schema.
Deployment fails.
---
It's really hard to assign blame, but I'd put more blame on Team 2 for not being defensive with their inputs enough.
As we all know there are greater issues with their deployment pipelines (lack of canaries, phased rollouts etc.) but no point going over those in this context.
The "fail safe" for a security component is in fact to prevent any user space code from running at all - better that than having it actively harm other systems, exfiltrate data, destroy connected hardware etc. So, no amount of clever design can prevent the CrowdStrike sensor from nuking your system if bad security rules get deployed.
For example, if a bad definition file makes it think that the legit libc or win32 libraries are compromised, it should prevent any userspace program from running, which is just as destructive as failing during boot.
That is why appropriate QA is critical for this type of program. I would expect any definition update of any kind to be tested on dozens of systems with a wide variety of Windows configurations and known-good software far before ever being deployed to any customer system. It seems that CrowdStrike thought the exact opposite of this, and in fact their customers were the first to ever run their new code end-to-end, not the last...
This is too binary a way to think about a complex system. Availability is also a security goal so we shouldn’t cavalierly trade it for minor risks which are mostly edge cases.
For example, say that the fail-safe was an old, old idea where they kept the second most recent version, and if the system failed to start or crashed repeatedly, it automatically rolled back to the last known good version. That turns this kind of problem into at most a reboot – a huge win every customer would have taken - and the only case it would introduce a vulnerability is if there’s an active attack which only the latest rules will block which is so virulent that the number of systems approximates the number who’ll be affected by a bad update. That’s an unlikely set of events, especially because there’s a really tight window where such a fast-spreading attack wouldn’t have compromised the host before CrowdStrike could ship the update.
Another variation of that idea: any time the system fails to start repeatedly, the service blocks processes other than its updater so normal apps aren’t exposed as potential vectors but the system can self-heal in most cases.
"This parameter count mismatch evaded multiple layers of build validation and testing, as it was not discovered during the sensor release testing process, the Template Type (using a test Template Instance) stress testing or the first several successful deployments of IPC Template Instances in the field."
Curious that csagent.sys isn't mentioned until last page, p. 12:
"csagent.sys is CrowdStrike’s file system filter driver, a type of kernel driver that registers with components of the Windows operating system…"
> Some people, when confronted with a problem, think
> “I know, I’ll use regular expressions.”
> Now they have two problems.
The thing I’ve been thinking about are all of the assurances they made about SDLC, testing, secure development practices, etc. They have so many huge customers in regulated industries, government, etc. that they completed almost every certification in existence and seeing this really raises questions about how those assertions were reviewed.
--
Team 1 tells Team 2 that the schema is updating.
Team 2 updates their schema.
Team 2 tests against updated schema (which would be a test file)
All green in test.
Team 1 doesn't actually follow the schema.
Deployment fails.
The new schema was improperly tested (among a list of other failures).
> The selection of data in the channel file was done manually and included a regex wildcard matching criterion in the 21st field for all Template Instances, meaning that execution of these tests during development and release builds did not expose the latent out-of-bounds read in the Content Interpreter when provided with 20 rather than 21 inputs.
(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)|(.+)
or even this (.{4})(.{7})(.{3})(.{6})(.{9})(.{4})(.{7})(.{3})(.{6})(.{9})(.{4})(.{7})(.{3})(.{6})(.{9})(.{4})(.{7})(.{3})(.{6})(.{9})(.{1})
would simply fail to match.And I wouldn't necessarily blame the developer in either scenario - they received a card that says "hey the channel file will now have an extra field in it's schema"... noone said "btw it's optional".
Calling it a "first year programming mistake" like I'm reading in some media is somewhat incendiary. I see unmarshalling errors happen all the time.
The forest that we must not miss is the kernel-level driver simply dies with no error recovery and bricks the system.
The bug in clients (sensors) wasn't due to regex, the regex was in their integration unit testing which also had a bug and was never supplying the 21st parameter to the client code.