2024-07-28

Microsoft technical breakdown of CrowdStrike incident

CrowdStrike failure and QA practices

Many see CrowdStrike’s rollout as grossly incompetent or negligent: no canary/staged deployment, inadequate testing of kernel code and of “content” (config) updates.
Bug path described: new kernel functionality for monitoring named pipes added earlier; driver shipped and “stable”; a malformed or unexpected content/update file later triggered a null dereference in kernel, causing BSODs.
Some argue this class of product must be designed as if safety‑critical: strong fuzzing, robust error handling (e.g., bad content disables rule, not the OS), telemetry on crashes.
Others note similar “instant push” practices exist in AV/EDR for indicators/signatures, because delays can matter during active attacks; they see CrowdStrike’s behavior as common, not obviously out of industry norms.

Kernel vs user‑mode security design

Large agreement that doing so much work in kernel space is dangerous; kernel‑mode should be minimal sensors and enforcement only, with parsing and logic in user space.
Some point to past research (CFI/XFI) and eBPF as ways to safely constrain code, and note that Linux/macOS push more EDR logic out of the kernel.
Counterpoint: real‑time filesystem and process interception with acceptable performance historically required kernel drivers on Windows; user‑mode APIs are still incomplete.

Microsoft’s role and EU/competition angle

One camp blames EU antitrust decisions: Microsoft tried to restrict kernel tampering (e.g., PatchGuard) and was forced to give third‑party security tools equal kernel‑level access with Defender.
Others respond that the EU only required equal access, not unlimited; Microsoft could have built safer out‑of‑kernel APIs and moved Defender there too.
Debate over how much fault lies with Microsoft for:
- Allowing third‑party kernel drivers at all.
- Not offering robust user‑mode security APIs.
- Not having stronger kernel safeguards (e.g., automatic rollback after repeated BSODs, better isolation of ELAM drivers).

Recovery and OS behavior

Several commenters argue Windows could have greatly mitigated impact by:
- Detecting repeated crashes from the same driver and offering to disable it, or
- Booting into a networked recovery mode with the offending driver disabled.
Others worry such behavior could be abused by malware to deliberately crash security drivers three times and escape protection.
Clarification: CrowdStrike’s driver is an ELAM/boot‑critical driver; Windows already treats those as non‑optional, limiting rollback behavior.

Comparisons to other platforms and mechanisms

macOS: widely cited for having pushed third‑party security tools out of kernel space; some argue Apple’s tight control and small desktop share made this politically easier.
Linux: CrowdStrike also ships a kernel module and eBPF sensor and has caused Linux outages too (including one tied to a Red Hat kernel bug).
eBPF for Windows exists but is described as limited and experimental; people see it as a promising long‑term alternative, not ready today.

Responsibility, negligence, and critical infrastructure

Many see CrowdStrike as principally at fault, with repeated severe incidents (Windows and Linux) described as “damning.”
Others argue responsibility is shared:
- Organizations choosing to run kernel‑level third‑party security on mission‑critical systems without robust fallbacks or test rings.
- Microsoft for designing and selling an OS where third‑party kernel code is normal and where catastrophic third‑party failures are hard to recover from.
Some question whether deaths actually occurred; others say even if unproven, outages of hospitals, airlines, and emergency services show unacceptable systemic risk.

Update practices and comparisons to Microsoft

Multiple comments highlight that even small fleets use staged rollouts; pushing an untested, globally deployed update to 8.5M endpoints is called “bonkers.”
Others counter that Microsoft and other vendors have also shipped flawed updates (Windows patches, Defender signatures, 365/Azure configs), and staged rollouts only reduce blast radius, not eliminate bad updates.
Still, many emphasize this was a content update, not code, and argue canaries and fuzzing should still have caught it.

Security ecosystem, surveillance, and market structure

Some view EDR/AV vendors as adding attack surface and instability more than security, especially when OS‑level defenses (e.g., Defender, macOS built‑ins) are already strong.
CrowdStrike is described by a few as “corporate spyware,” though others argue any large enterprise will monitor endpoints and that CrowdStrike is not the primary employee‑surveillance tool.
Discussion on vendor lock‑in: many organizations “choose” Windows because key software only runs there; security/availability are rarely decisive market factors.
A minority warn against using the incident to justify “digital totalitarianism” where only the OS vendor can ship powerful software.

Standards, liability, and future directions

Calls for:
- Stronger liability for digital infrastructure, analogous to safety standards for physical goods.
- Possibly Microsoft‑run fuzzing / “Project Zero‑style” scrutiny for widely deployed drivers and apps.
- More Rust and memory‑safe code in the Windows kernel, better user‑mode security APIs, and expanded eBPF support.
Disagreement remains over whether Microsoft has learned enough from its own long history of botched updates to credibly “teach” others, or whether all major vendors remain too error‑prone.

Related topics