2024-09-13

CrowdStrike ex-employees: 'Quality control was not part of our process'

Overall Theme: Speed vs. Quality in a Critical Security Product

Many commenters see the outage as strong evidence that velocity was prioritized over quality, especially for “Rapid Response” content.
The idea that “quality control wasn’t part of the process” matches multiple readers’ experience of modern tech culture: move fast, cut QA/SDET, let developers absorb testing.
Others caution that a single catastrophic event doesn’t prove chronic underinvestment without more data, but agree basic safeguards were clearly missing.

Debate over Ex-Employee Testimony

Some dismiss the article’s reliance on former employees, arguing they may be disgruntled, biased, or far from kernel work (e.g., UX).
Others counter that:
- The RCA already confirms serious process failures.
- Multiple ex-employees across roles reporting consistent issues is meaningful signal.
- Corporate PR has its own, stronger bias.
Several note explicit examples from the article where ex-employee claims about product behavior are weakly or inconsistently rebutted by the company.

Technical and Process Failures

Key points drawn from the RCA and discussion:
- Rapid Response content bypassed the staged rollout and dogfooding used for full sensor releases.
- A validator bug allowed malformed content through, crashing a kernel driver that poorly handled invalid input.
- Configuration parsing in a kernel module, lack of bounds checks, and insufficient test coverage are seen as fundamental engineering failures.
- Commenters stress that even “data” updates can be as dangerous as code and must be treated as untrusted input.

Previous Linux Incident and Failure to Generalize

A prior Linux bricking incident is discussed: some blame an upstream kernel regression; others argue the lesson should still have been “never push globally without strong testing and rollback.”
Point made that you don’t just fix the specific failure, you harden against the entire class of risks.

Industry Culture, Regulation, and Accountability

Many say this is typical of large software orgs: weak QA, hero culture, incentives to hide problems rather than prevent them.
Comparisons are drawn to aviation, building codes, and financial trading systems where regulation, independent postmortems, and professional licensing enforce quality.
Several advocate similar regulation for critical software and even licensure for software engineers working on safety/security-critical systems.

Security Tool Data Collection and Secrets

A side thread highlights that the macOS agent sends environment variables (including secrets) to a cloud SIEM:
- Some say this is standard for EDR/SIEM and that the SIEM or customer should mask sensitive data.
- Others argue plaintext secrets in centralized logs are a serious design and compliance problem, especially under regimes like PCI and GDPR.

Impact, Market, and Alternatives

Anecdotes describe significant real-world harm (e.g., delayed surgeries) beyond financial loss.
Commenters note the outage is effectively a massive self-inflicted denial of service.
Despite the incident, the company’s market position remains strong, attributed to compliance and insurer pressure and a lack of clear drop-in alternatives.
Alternatives mentioned: Microsoft Defender/Defender for Endpoint and Sentinel, SentinelOne, Carbon Black, or in-house capability—though insurance and regulations often require third-party EDR.

Related topics