Cloudflare outage should not have happened
How Critical Is Cloudflare?
- Some argue Cloudflare now resembles critical infrastructure: taking down “lots of websites” at once can plausibly have life-or-death downstream impacts (healthcare, emergency coordination, research, etc.).
- Others counter that this still isn’t comparable to safety‑critical systems like bridges or avionics, and we shouldn’t demand the same level of engineering rigor.
- A middle view: Cloudflare’s core proxy/DDOS stack has become “insulin pump–like” in importance and should trade speed of feature delivery for much higher reliability.
Root Cause vs Blast Radius
- Many commenters think the blog over-attributes the outage to database design; they see the real failure in the deployment model and blast radius:
- A bad config/query was rolled out quickly and globally with no effective staging, rate limiting, or circuit breakers.
- Systems crashed hard (panic/OOM) instead of failing closed, reverting to last-known-good config, or degrading gracefully.
- Suggested mitigations: blue/green or phased rollouts; hard caps and alerts on config churn or output size; production-like integration tests using real backups; chaos/outage simulations; automated rollback as the default response to catastrophic errors.
Database Rigor and Formal Methods
- The article’s prescription (“no NULLs, fully normalized schema, formally verified code”) is widely viewed as idealistic:
- Normalization and constraints are good practice but wouldn’t have guaranteed catching this specific cross-database query bug.
- DISTINCT/LIMIT in the query might have masked the issue instead of fixing it.
- Formal verification is described as extremely costly and only practical for very small, critical surfaces, and still depends on humans specifying the right properties.
Rust, Panics, and unwrap()
- Large subthread on Rust’s
unwrap():- Some say
unwrap()in production—especially in config paths—is an obvious anti-pattern that linters or policies should forbid in critical services. - Others defend
unwrap()as just an assertion: acceptable when failure truly is unrecoverable or “should never happen,” with the real issue being upstream design and rollout, not the panic site. - Proposals include language or tooling support to statically track and ban panics (beyond malloc) across dependencies; critics worry this becomes complex and Java-like.
- Some say
Postmortems, Blame, and Centralization
- Debate over “root cause analysis”: some call it misleading for complex, multicausal failures and better replaced with 5‑whys and “Swiss cheese” models.
- Several see the blog as hindsight-heavy “Monday morning quarterbacking,” others as a useful prompt to discuss trade-offs.
- A recurring meta-point: Cloudflare’s extreme centralization makes any single mistake disproportionately damaging; some argue the deeper issue is the web’s dependence on a few chokepoints rather than one specific query or language feature.