The Therac-25 Incident (2021)
Therac-25 as a systemic failure, not a lone “bug”
- Many comments stress that Therac-25 was not just “bad code” but a system failure: missing hardware interlocks, weak processes, slow incident escalation, poor field feedback, and bad safety assumptions.
- Older models had mechanical interlocks and even the same fault, but a physical fuse prevented harm; removing those interlocks without a new safety concept was seen as the key blunder.
- Several people argue there is almost never a single “root cause”; instead multiple defenses fail (“Swiss cheese model”).
Software vs hardware, and the role of independent failsafes
- Strong emphasis that safety engineering should assume software will fail and use independent hardware protections (interlocks, radiation sensors, physical limits).
- Electromechanical failsafes are praised because their design is orthogonal to software and their failure is harder to ignore than on‑screen errors.
- Examples from industrial automation and aviation reinforce the idea: hard‑wired e‑stops, independent instruments, and formal failure analysis (e.g., required at Boeing).
“Most deadly bug?” – other catastrophic software-related failures
- Candidates mentioned:
- Boeing 737 MAX / MCAS (hundreds of deaths; debate over “bug” vs bad design, sensor reliance, and training avoidance).
- Air France 447 control input handling.
- London Ambulance dispatch collapse in the 90s.
- UK Post Office Horizon scandal (false accounting, bankruptcies, suicides).
- Patriot missile timing error in 1991.
- Alleged AI targeting systems in warfare.
- Several note gray areas: where bad policy, concealment, or economics matter more than pure coding faults.
Process, culture, and developer quality
- One camp: quality is primarily the result of process, feedback loops, and organizational culture (reporting incidents, fixing them, documenting, independent QA, regulation).
- Another camp: good developers are a necessary precondition; no process can compensate for uniformly poor engineers.
- Many settle on a combined view: talent, process, and a culture of caring about quality are all required, especially for safety‑critical systems.
AI, “vibe‑coding,” and future Therac-style incidents
- Strong concern that LLM‑generated, untested code and “vibe‑coding” culture will recreate Therac‑style failures.
- A cited LLM‑induced outage is seen as a warning; people fear agentic systems being attached to real hardware or medical devices.
Education, regulation, and ethics
- Many were taught Therac‑25 (and analogs like Tacoma Narrows, Hyatt walkway) in CS/engineering ethics; others never saw it, or saw classmates treat it as a joke.
- Some point to modern standards (e.g., medical software standards and FDA scrutiny) as reasons a Therac‑25‑level incident is now less likely, while others doubt process alone can prevent failures without ethical, empowered engineers.