10% of Firefox crashes are caused by bitflips

How Firefox Is Attributing Crashes to Bitflips

  • Firefox added a post-crash memory tester that runs on user machines; code is public (Rust runner + separate memtest crate).
  • Described techniques include:
    • Writing known bit patterns to RAM and reading back to detect flips.
    • Using “magic” sentinel values in data structures and checking whether they differ by only one or a few bits.
  • Reported measurement: ~5% of crashes flagged as “potentially” due to bad/flaky memory; author then extrapolates up to ~10–15% with a “conservative heuristic,” which is not fully explained.
  • Several commenters note that “potential” and the missing details make the true rate unclear.

Skepticism About the 10–15% Claim

  • Some find 10% of crashes from hardware defects “huge” and hard to believe, suspecting biased telemetry (e.g., small number of very bad machines).
  • Others criticize the extrapolation from 5% to 10% as unsupported handwaving.
  • Concerns that rare races, allocator or kernel bugs, or Firefox-specific issues could be misclassified as hardware faults.
  • Counter‑argument: large-scale crash triage in other systems (OSes, games, Go toolchain) also reveals a nontrivial tail of crashes best explained by memory or CPU faults.

User Reports and Comparative Behavior

  • Mixed experiences: some users see Firefox crash frequently (often on exit or under high tab count), others report near-zero crashes over years.
  • Multiple anecdotes of Firefox being the first app to fail on machines later diagnosed with bad RAM or misconfigured/overclocked memory.
  • Others claim Chromium-based browsers crash less on the same hardware, suggesting Firefox might simply be buggier or more memory-hungry.
  • It’s noted that crashes are concentrated on faulty machines, so “10% of crashes” does not mean 10% of users are impacted.

Hardware, ECC, and Bitflip Context

  • Commenters emphasize that bitflips can arise from marginal RAM, heat, aging, PSU issues, or misconfiguration, not only cosmic rays.
  • ECC RAM and CPU cache ECC significantly reduce or surface errors but don’t eliminate them; many consumer systems lack full ECC support.
  • DDR5’s on-die “ECC” is distinguished from system-wide ECC; seen as improving yield/error rates but not equivalent to traditional ECC DIMMs.

Mitigations and Open Questions

  • Suggestions:
    • Run analysis locally and inform users when memory appears flaky.
    • Map out bad RAM regions in the OS.
    • Add redundancy/checksums for critical in-memory data.
  • Some argue engineering around bad hardware isn’t worthwhile except in safety‑critical systems; others say robustness to hardware faults is increasingly important.
  • Several commenters express interest in comparable data from Chrome and in a proper, detailed write‑up of Firefox’s methodology.