Cloudflare outage on November 18, 2025 post mortem

Incident mechanics and scope

  • A ClickHouse permission change made a metadata query (system.columns without DB filter) start returning duplicate columns from an additional schema.
  • That doubled the Bot Management “feature file” used by Cloudflare’s new FL2 proxy; the file now exceeded a hard 200-feature limit.
  • The FL2 bot module hit that limit, returned an error, and the calling code used unwrap() on the Result, panicking and crashing the worker thread.
  • The oversized config was refreshed and pushed globally every few minutes, so the “poison pill” propagated quickly and repeatedly.
  • Old FL proxies failed in a “softer” way (all traffic got bot score 0) while FL2 crashed and returned massive volumes of 5xx errors.

Testing, staging, and rollout

  • Many commenters argue the failure should have been caught in staging or CI by:
    • Realistic data-volume tests or synthetic “20x data” tests.
    • Golden-result tests for key DB queries before and after permission changes.
    • Validating the generated feature file (size, duplicates, schema) and test-loading it into a proxy before global rollout.
  • Others note that duplicating Cloudflare’s production scale for staging is extremely expensive, but counter that:
    • You don’t need full scale for every commit; periodic large-scale tests and strong canarying would help.
    • Config changes that can take down the fleet should have progressive, ring-based rollouts and auto-rollback, not “push everywhere every 5 minutes”.

Rust, unwrap(), and error handling

  • Large subthread around whether using unwrap() in critical Rust code is acceptable.
    • Critics: in production, unwrap() is equivalent to an unguarded panic, hides invariants that should be expressed as Result handling, and should be linted or banned.
    • Defenders: the real problem is the violated invariant and lack of higher-level handling; replacing unwrap() with return Err(...) would still have yielded 5xxs without better design.
  • Broader debate compares Rust’s Result-style errors vs exceptions, checked vs unchecked, and how easy it is in all languages to paper over error paths.

Architecture, blast radius, and fail modes

  • Many point out this was not “just a bug” but an architectural issue:
    • A non-core feature (bot scoring) was able to crash the core proxy.
    • The system failed “fail-crash” instead of “fail-open” or “keep last-good config”.
  • Suggestions:
    • Treat rapid, global config as dangerous code: canaries, fault isolation (“cells”/regions), global kill switches with care, and strong observability on panics and config ingestion.
    • Ensure panics in modules are survivable by supervisors or by falling back to previous configs, with clear alerts.

Operational response and transparency

  • Some are impressed by how fast and detailed the public postmortem appeared, including code snippets and a candid incident timeline.
  • Others focus on the ~3 hours to identify the feature file as root cause, questioning:
    • Why massive new panics in FL2 weren’t an immediate, high-signal alert.
    • Why “it’s a DDoS” was the dominant hypothesis for so long.
  • The separate outage of the third-party status page further biased engineers toward believing it was an attack.

Centralization and systemic risk

  • Extensive reflection on how much of the internet now depends on a few providers (Cloudflare, AWS, etc.), drawing analogies to historic telco and infrastructure outages.
  • Some users report practical impact (unable to manage DNS, log into services) and reconsider reliance on a single CDN/DNS provider.
  • A minority argues for regulation and liability around critical internet infrastructure; others counter that outages are inevitable in complex systems and that learning from failures is the path to resilience.