Cloudflare outage on November 18, 2025 post mortem
Incident mechanics and scope
- A ClickHouse permission change made a metadata query (
system.columnswithout DB filter) start returning duplicate columns from an additional schema. - That doubled the Bot Management “feature file” used by Cloudflare’s new FL2 proxy; the file now exceeded a hard 200-feature limit.
- The FL2 bot module hit that limit, returned an error, and the calling code used
unwrap()on theResult, panicking and crashing the worker thread. - The oversized config was refreshed and pushed globally every few minutes, so the “poison pill” propagated quickly and repeatedly.
- Old FL proxies failed in a “softer” way (all traffic got bot score 0) while FL2 crashed and returned massive volumes of 5xx errors.
Testing, staging, and rollout
- Many commenters argue the failure should have been caught in staging or CI by:
- Realistic data-volume tests or synthetic “20x data” tests.
- Golden-result tests for key DB queries before and after permission changes.
- Validating the generated feature file (size, duplicates, schema) and test-loading it into a proxy before global rollout.
- Others note that duplicating Cloudflare’s production scale for staging is extremely expensive, but counter that:
- You don’t need full scale for every commit; periodic large-scale tests and strong canarying would help.
- Config changes that can take down the fleet should have progressive, ring-based rollouts and auto-rollback, not “push everywhere every 5 minutes”.
Rust, unwrap(), and error handling
- Large subthread around whether using
unwrap()in critical Rust code is acceptable.- Critics: in production,
unwrap()is equivalent to an unguarded panic, hides invariants that should be expressed asResulthandling, and should be linted or banned. - Defenders: the real problem is the violated invariant and lack of higher-level handling; replacing
unwrap()withreturn Err(...)would still have yielded 5xxs without better design.
- Critics: in production,
- Broader debate compares Rust’s
Result-style errors vs exceptions, checked vs unchecked, and how easy it is in all languages to paper over error paths.
Architecture, blast radius, and fail modes
- Many point out this was not “just a bug” but an architectural issue:
- A non-core feature (bot scoring) was able to crash the core proxy.
- The system failed “fail-crash” instead of “fail-open” or “keep last-good config”.
- Suggestions:
- Treat rapid, global config as dangerous code: canaries, fault isolation (“cells”/regions), global kill switches with care, and strong observability on panics and config ingestion.
- Ensure panics in modules are survivable by supervisors or by falling back to previous configs, with clear alerts.
Operational response and transparency
- Some are impressed by how fast and detailed the public postmortem appeared, including code snippets and a candid incident timeline.
- Others focus on the ~3 hours to identify the feature file as root cause, questioning:
- Why massive new panics in FL2 weren’t an immediate, high-signal alert.
- Why “it’s a DDoS” was the dominant hypothesis for so long.
- The separate outage of the third-party status page further biased engineers toward believing it was an attack.
Centralization and systemic risk
- Extensive reflection on how much of the internet now depends on a few providers (Cloudflare, AWS, etc.), drawing analogies to historic telco and infrastructure outages.
- Some users report practical impact (unable to manage DNS, log into services) and reconsider reliance on a single CDN/DNS provider.
- A minority argues for regulation and liability around critical internet infrastructure; others counter that outages are inevitable in complex systems and that learning from failures is the path to resilience.