2025-12-05

Cloudflare outage on December 5, 2025

Root Cause and Language Debates

Outage traced to a long‑standing Lua bug in an old proxy (FL1): a rule with action “execute” assumed a nested object always existed; when a killswitch skipped that rule, the object was nil and Lua crashed.
Many note this is very similar in shape to the recent Rust unwrap() incident: code implicitly assumes success and then fails closed.
Disagreement over Cloudflare’s claim that “strong type systems” would prevent this:
- Some argue Rust/strong typing can encode these invariants and make such bugs compile‑time errors.
- Others point out Cloudflare just had a nearly identical failure in Rust because they opted into panicking APIs (unwrap), so language choice isn’t sufficient without discipline, linting, and review.

Deployment, Monitoring, and Rollback

Central criticism: a global configuration system with no gradual rollout and near‑instant worldwide propagation is inherently high‑blast‑radius.
Timeline (≈25 minutes from bad change to full recovery) sparked debate:
- Some say 2‑minute alerting is “terrible” at this scale; others defend common Prometheus‑style scrape intervals and denoising.
Many feel they should have immediately rolled back the first change once internal errors appeared instead of issuing a second, global “killswitch” change through the same risky channel.
Questions raised about how well deployment teams can correlate config pushes to error spikes and whether on‑call engineers had real‑time visibility and authority to slam the rollback button.

Testing, Staging, and Tech Debt

Commenters are stunned that a “never before used” killswitch+execute path in a critical rules engine had apparently never been unit‑tested or fuzzed.
Several argue large‑scale infra is hard to fully simulate, but this particular nil‑dereference was straightforward to catch with basic tests.
Broader concern that Cloudflare’s rapid product expansion and legacy Lua glue code have accumulated tech debt and knowledge silos faster than quality engineering can keep up.

Security vs. Availability Tradeoff (React CVE)

Cloudflare was rolling out WAF changes to mitigate a serious React Server RCE.
Some defend the urgency: every hour of delay risks active exploitation; quick, global WAF updates are part of Cloudflare’s value.
Others note this CVE wasn’t a same‑day zero‑day and say “rushed security fix” doesn’t justify skipping progressive rollout and ignoring early warning signals.

Critical Infrastructure and Centralization

Strong disagreement over impact:
- Some downplay 30 minutes of downtime as acceptable.
- Many insist Cloudflare is now de facto critical infrastructure (including healthcare, finance, and safety‑related systems), so such outages are unacceptable.
Recurrent worry about monoculture: one company’s mistake knocking out a large fraction of the web contradicts the internet’s original decentralized, resilient design.
This fuels calls for multi‑CDN setups, more self‑hosting, or at least smaller blast radii inside Cloudflare.

Perception of Culture and Reliability

Multiple comments describe Cloudflare’s approach as “move fast” or “cowboy” ops: continuing risky rollouts despite a recent similar outage and known deficiencies in the config system.
Others praise Cloudflare’s unusually detailed, transparent postmortems but complain that transparency is starting to feel like a substitute for fixing systemic deployment and testing issues.
Some ask explicitly whether internal incentives, cost‑cutting, AI‑generated code, or leadership changes are weakening operational rigor; others caution these are speculative and unproven from the outside.

Related topics