2025-11-26

Cloudflare outage should not have happened

How Critical Is Cloudflare?

Some argue Cloudflare now resembles critical infrastructure: taking down “lots of websites” at once can plausibly have life-or-death downstream impacts (healthcare, emergency coordination, research, etc.).
Others counter that this still isn’t comparable to safety‑critical systems like bridges or avionics, and we shouldn’t demand the same level of engineering rigor.
A middle view: Cloudflare’s core proxy/DDOS stack has become “insulin pump–like” in importance and should trade speed of feature delivery for much higher reliability.

Root Cause vs Blast Radius

Many commenters think the blog over-attributes the outage to database design; they see the real failure in the deployment model and blast radius:
- A bad config/query was rolled out quickly and globally with no effective staging, rate limiting, or circuit breakers.
- Systems crashed hard (panic/OOM) instead of failing closed, reverting to last-known-good config, or degrading gracefully.
Suggested mitigations: blue/green or phased rollouts; hard caps and alerts on config churn or output size; production-like integration tests using real backups; chaos/outage simulations; automated rollback as the default response to catastrophic errors.

Database Rigor and Formal Methods

The article’s prescription (“no NULLs, fully normalized schema, formally verified code”) is widely viewed as idealistic:
- Normalization and constraints are good practice but wouldn’t have guaranteed catching this specific cross-database query bug.
- DISTINCT/LIMIT in the query might have masked the issue instead of fixing it.
- Formal verification is described as extremely costly and only practical for very small, critical surfaces, and still depends on humans specifying the right properties.

Rust, Panics, and unwrap()

Large subthread on Rust’s unwrap():
- Some say unwrap() in production—especially in config paths—is an obvious anti-pattern that linters or policies should forbid in critical services.
- Others defend unwrap() as just an assertion: acceptable when failure truly is unrecoverable or “should never happen,” with the real issue being upstream design and rollout, not the panic site.
- Proposals include language or tooling support to statically track and ban panics (beyond malloc) across dependencies; critics worry this becomes complex and Java-like.

Postmortems, Blame, and Centralization

Debate over “root cause analysis”: some call it misleading for complex, multicausal failures and better replaced with 5‑whys and “Swiss cheese” models.
Several see the blog as hindsight-heavy “Monday morning quarterbacking,” others as a useful prompt to discuss trade-offs.
A recurring meta-point: Cloudflare’s extreme centralization makes any single mistake disproportionately damaging; some argue the deeper issue is the web’s dependence on a few chokepoints rather than one specific query or language feature.

Related topics