10% of Firefox crashes are caused by bitflips
How Firefox Is Attributing Crashes to Bitflips
- Firefox added a post-crash memory tester that runs on user machines; code is public (Rust runner + separate
memtestcrate). - Described techniques include:
- Writing known bit patterns to RAM and reading back to detect flips.
- Using “magic” sentinel values in data structures and checking whether they differ by only one or a few bits.
- Reported measurement: ~5% of crashes flagged as “potentially” due to bad/flaky memory; author then extrapolates up to ~10–15% with a “conservative heuristic,” which is not fully explained.
- Several commenters note that “potential” and the missing details make the true rate unclear.
Skepticism About the 10–15% Claim
- Some find 10% of crashes from hardware defects “huge” and hard to believe, suspecting biased telemetry (e.g., small number of very bad machines).
- Others criticize the extrapolation from 5% to 10% as unsupported handwaving.
- Concerns that rare races, allocator or kernel bugs, or Firefox-specific issues could be misclassified as hardware faults.
- Counter‑argument: large-scale crash triage in other systems (OSes, games, Go toolchain) also reveals a nontrivial tail of crashes best explained by memory or CPU faults.
User Reports and Comparative Behavior
- Mixed experiences: some users see Firefox crash frequently (often on exit or under high tab count), others report near-zero crashes over years.
- Multiple anecdotes of Firefox being the first app to fail on machines later diagnosed with bad RAM or misconfigured/overclocked memory.
- Others claim Chromium-based browsers crash less on the same hardware, suggesting Firefox might simply be buggier or more memory-hungry.
- It’s noted that crashes are concentrated on faulty machines, so “10% of crashes” does not mean 10% of users are impacted.
Hardware, ECC, and Bitflip Context
- Commenters emphasize that bitflips can arise from marginal RAM, heat, aging, PSU issues, or misconfiguration, not only cosmic rays.
- ECC RAM and CPU cache ECC significantly reduce or surface errors but don’t eliminate them; many consumer systems lack full ECC support.
- DDR5’s on-die “ECC” is distinguished from system-wide ECC; seen as improving yield/error rates but not equivalent to traditional ECC DIMMs.
Mitigations and Open Questions
- Suggestions:
- Run analysis locally and inform users when memory appears flaky.
- Map out bad RAM regions in the OS.
- Add redundancy/checksums for critical in-memory data.
- Some argue engineering around bad hardware isn’t worthwhile except in safety‑critical systems; others say robustness to hardware faults is increasingly important.
- Several commenters express interest in comparable data from Chrome and in a proper, detailed write‑up of Firefox’s methodology.