2025-11-18

Cloudflare outage on November 18, 2025 post mortem

Incident mechanics and scope

A ClickHouse permission change made a metadata query (system.columns without DB filter) start returning duplicate columns from an additional schema.
That doubled the Bot Management “feature file” used by Cloudflare’s new FL2 proxy; the file now exceeded a hard 200-feature limit.
The FL2 bot module hit that limit, returned an error, and the calling code used unwrap() on the Result, panicking and crashing the worker thread.
The oversized config was refreshed and pushed globally every few minutes, so the “poison pill” propagated quickly and repeatedly.
Old FL proxies failed in a “softer” way (all traffic got bot score 0) while FL2 crashed and returned massive volumes of 5xx errors.

Testing, staging, and rollout

Many commenters argue the failure should have been caught in staging or CI by:
- Realistic data-volume tests or synthetic “20x data” tests.
- Golden-result tests for key DB queries before and after permission changes.
- Validating the generated feature file (size, duplicates, schema) and test-loading it into a proxy before global rollout.
Others note that duplicating Cloudflare’s production scale for staging is extremely expensive, but counter that:
- You don’t need full scale for every commit; periodic large-scale tests and strong canarying would help.
- Config changes that can take down the fleet should have progressive, ring-based rollouts and auto-rollback, not “push everywhere every 5 minutes”.

Rust, unwrap(), and error handling

Large subthread around whether using unwrap() in critical Rust code is acceptable.
- Critics: in production, unwrap() is equivalent to an unguarded panic, hides invariants that should be expressed as Result handling, and should be linted or banned.
- Defenders: the real problem is the violated invariant and lack of higher-level handling; replacing unwrap() with return Err(...) would still have yielded 5xxs without better design.
Broader debate compares Rust’s Result-style errors vs exceptions, checked vs unchecked, and how easy it is in all languages to paper over error paths.

Architecture, blast radius, and fail modes

Many point out this was not “just a bug” but an architectural issue:
- A non-core feature (bot scoring) was able to crash the core proxy.
- The system failed “fail-crash” instead of “fail-open” or “keep last-good config”.
Suggestions:
- Treat rapid, global config as dangerous code: canaries, fault isolation (“cells”/regions), global kill switches with care, and strong observability on panics and config ingestion.
- Ensure panics in modules are survivable by supervisors or by falling back to previous configs, with clear alerts.

Operational response and transparency

Some are impressed by how fast and detailed the public postmortem appeared, including code snippets and a candid incident timeline.
Others focus on the ~3 hours to identify the feature file as root cause, questioning:
- Why massive new panics in FL2 weren’t an immediate, high-signal alert.
- Why “it’s a DDoS” was the dominant hypothesis for so long.
The separate outage of the third-party status page further biased engineers toward believing it was an attack.

Centralization and systemic risk

Extensive reflection on how much of the internet now depends on a few providers (Cloudflare, AWS, etc.), drawing analogies to historic telco and infrastructure outages.
Some users report practical impact (unable to manage DNS, log into services) and reconsider reliance on a single CDN/DNS provider.
A minority argues for regulation and liability around critical internet infrastructure; others counter that outages are inevitable in complex systems and that learning from failures is the path to resilience.

Related topics