2024-07-19

CrowdStrike fixes start at "reboot up to 15 times", gets more complex from there

Power and responsibility of global updates

Many commenters recoil at the idea of being “the person who presses the button,” describing intense stress when large rollouts go wrong.
Others joke about the godlike power of bricking the world, but note anyone who’s held that power in reality would never want it.
Strong sentiment that individual operators shouldn’t be scapegoated; this is seen as a systemic/process failure.

How the faulty update and “15 reboots” work

CrowdStrike’s driver loads very early in boot, phones home, and pulls frequently updated “channel/data” files.
The bug is triggered by a mangled data/config file that crashes the driver and causes BSODs.
Rebooting repeatedly is seen as a probabilistic race: maybe the agent fetches fixed data before hitting the bad path. Many view this “solution” as pathetic and fragile.

Kernel‑mode security software risks

Core criticism: AV/EDR with kernel privileges auto-loading unvalidated data is an enormous attack and failure surface.
Complaints about: no robust input validation, lack of graceful failure, use of memory-unsafe languages in the kernel, and ability for a corrupt file to brick the OS.
Some argue AV must run at this level to defeat rootkits; others say it’s “lazy” design and more could be done in user space or via microkernel-style patterns.

Auto‑updates, QA, and rollout practices

Many say automatic, global, immediate updates for kernel-level components (even “just” data/config) are unacceptable for critical systems.
Calls for staged/canary rollouts, stronger CI/fuzzing of parsers, and clearer separation of what can auto-update.
Others counter that virus definitions need rapid deployment, making staged rollouts tricky, but agree this design left no safety net.

Compliance, insurance, and “checkbox security”

Strong theme: CrowdStrike is seen as a compliance checkbox driven by regulators and cyber-insurance, not actual security engineering.
Pattern described: stricter liability → cyber insurance → mandated EDR → near-universal adoption of the same fragile tool → systemic risk.
“Security & Compliance” teams are accused of bypassing good engineering practices because their tools are deemed “so important.”

OS choice, monoculture, and blame

Debate over blaming Windows vs. CrowdStrike vs. the monoculture:
- Some say Windows’s model (third-party kernel modules, widespread use) makes this inevitable.
- Others note CrowdStrike has also broken Linux, and any kernel-space blob is inherently dangerous.
Several argue critical infrastructure shouldn’t depend on a single OS or a single vendor’s EDR agent.

Operational impact and real-world stories

First‑hand reports from shops and plants: CNC machines and lathes down, AC and alarms misbehaving, phones and email offline, payroll at risk.
Many industrial systems are described as expensive machines “strapped to a Windows PC,” often mandated to be networked for remote support or monitoring, then wrapped with corporate EDR for compliance.
Commenters question why such equipment is internet-connected and running broad endpoint tools, but others point to real business needs (remote diagnostics, SCADA overviews, utilization analytics).

Root cause theories and technical concerns

Some claim the bad file was effectively zeroed out, implying almost no validation before kernel parsing.
Concern that if malformed data can crash the kernel, it might also be exploitable for remote code execution if crafted.
Multiple commenters call this a “global multi-layer failure”: OS design, vendor design, lack of staged rollouts, poor DR planning, and the ubiquity of a single security product.

Proposed reforms and lessons

Suggestions range from:
- Forcing detailed public technical postmortems and possibly congressional hearings.
- Treating auto-updating kernel/EDR components as a national security issue, potentially regulated.
- Requiring graceful failure modes and stronger isolation instead of relying on “heroes” or blind trust in vendors.
- Greater use of open source and owner control to reduce black-box, above-root agents.

Related topics