2024-08-11

CrowdStrike accepting the PwnieAwards for "most epic fail" at defcon

Reactions to accepting the PwnieAward

Many see showing up and accepting the “most epic fail” award as the least-bad PR option: refusing would look evasive, attending allows public contrition and a reminder to staff.
Others call it “tone deaf” and trivializing a catastrophe; they view it as laughing off a disaster that caused global disruption.
Some note the acceptance speech came across as sober and self‑critical, not jokey; critics respond that context (a fun con talk, applause, trophy) makes it inappropriate regardless of tone.

Human impact and seriousness of the outage

Commenters describe severe real‑world impact: grounded flights, hospital and ER disruptions, 911 outages, pharmacy issues, lost business and productivity.
Debate over deaths: some are “certain” people died indirectly (delayed care, emergency stress), others say no concrete evidence has surfaced and stress that hospitals have downtime procedures.
Several note that even “just” elevated stress and missed life events (funerals, last goodbyes, surgeries) are serious harms.

Liability, lawsuits, and contracts

Many ask why there are few visible lawsuits given claimed losses in the billions.
CrowdStrike contracts reportedly cap liability to low millions; some argue that won’t withstand “gross negligence” claims, especially from insurers.
Delta’s suit and CS’s public response are discussed: CS points to contractual caps, hints at aggressive discovery into Delta’s IT practices, and suggests Delta’s prolonged outage was partly its own fault.
Some expect insurers and reinsurers to be the main drivers of any serious reckoning, e.g., by surcharging or refusing coverage when CS is in the stack.

Responsibility: CrowdStrike vs customers

Strong consensus that CS’s process was egregious: an update that crashes essentially 100% of target Windows systems implies fundamental testing and rollout failures.
Key detail: the “rapid response” update apparently bypassed customers’ usual staged rollout controls, leaving them unable to canary it.
Others argue enterprises also bear blame for:
- Allowing a third‑party kernel driver to be a single point of failure on critical systems.
- Not designing fallback procedures and “analog” continuity plans robust enough for such outages.
- Over‑relying on cloud and endpoint tools to satisfy auditors and insurers, not genuine risk analysis.

Software vs civil engineering and calls for accountability

Large sub‑thread compares software to civil engineering:
- One side: bridges have clear standards, licensing, and personal liability; software should evolve similar norms, especially for life‑critical systems.
- Opposing view: software changes too fast, is vastly more complex, and is attacked continuously; perfect safety is impossible and over‑regulation would cripple competitiveness.
Some advocate for a professionalized “real engineering” tier with licenses and sign‑off liability for safety‑critical code; others warn it would mostly create rent‑seeking gatekeepers and push innovation offshore.

Security tooling, SPOFs, and industry incentives

Many criticize the entire model of managed endpoint security:
- Closed‑source kernel code parsing untrusted input is seen as inherently dangerous.
- Centralized products that can remotely brick all endpoints are called “security single points of failure.”
Commenters note that many organizations deploy such tools mainly to tick compliance/insurance boxes; the risk of catastrophic vendor failure was underappreciated.
Some argue that if a system is truly life‑critical, running networked Windows with third‑party kernel agents is itself negligent, regardless of CS’s bug.

What consequences should follow

Views range from:
- “Nuke the company” / bankrupt and reconstitute it as a warning,
- To “fix the processes, don’t scapegoat individuals,” similar to how some large outages at other providers were handled.
Skeptics doubt meaningful change will occur without:
- Legal liability that survives EULAs and caps.
- Insurance pressure that makes unsafe stacks uninsurable.
- Cultural shift away from “move fast and break things” toward genuine engineering discipline.

Related topics