2025-12-05

Cloudflare was down

Scope and Nature of the Outage

Large portion of the internet briefly returned plain 500 errors branded “cloudflare”: npm, Supabase, Notion, Shopify, Claude, Perplexity, LinkedIn, major crypto exchanges, media and anime sites, documentation sites, etc.
Some Cloudflare users were unaffected: many small sites, some Workers / Tunnels / R2 / KV use-cases, and non-proxied setups stayed up; in some cases websockets worked while main sites failed.
Cloudflare’s own website, dashboard and APIs were down; some third-party services that depend on Cloudflare (e.g., Porkbun DNS UI, Docker Hub, various SaaS) also failed.

Status Pages, SLAs, and Transparency

Cloudflare’s status page initially showed only “scheduled maintenance” (Chicago) and later a narrow “dashboard/API issues” incident, conflicting with widespread customer 500s.
Many argue big providers’ status pages are “for show,” incentivized by SLAs to under-report outages as “degraded performance” and to delay flipping to “down.”
Others stress that keeping customers informed is a core part of incident response, and status pages should be independent, automated where possible, and even hosted off-provider.

Centralization and Single Points of Failure

Multiple comments note how deeply Cloudflare has become a single point of failure: when it breaks, “half the internet” appears down, including monitoring sites like DownDetector.
Some say Cloudflare’s free tier and integrated CDN/WAF/DDoS offering explain this dominance; alternatives (Fastly, bunny.net, CloudFront, etc.) often cost more or are more complex.
Debate over whether it’s reasonable for non-critical businesses to accept rare global outages versus critical sectors (banks, hospitals, ATC) that must design around any third‑party SPOF.

Cloudflare’s Explanation and Engineering Practices

Later incident note: a change to the Web Application Firewall’s request parsing, rolled out to mitigate a new React Server Components vulnerability, made Cloudflare’s network unavailable for several minutes; explicitly “not an attack.”
Many note a recurring pattern: global outages triggered by config/WAF changes without apparent staged rollout or canaries; criticism that this contradicts industry best practices for critical infra.
Discussion about Rust vs previous stacks concludes the problems are in operational discipline, configuration and rollout strategy, not in the language itself.

Reliability Trends, Architecture, and Industry Culture

Concern that this is the second or third major Cloudflare incident in weeks, eroding trust and making them look like “the weak link of the internet.”
Some argue internet-scale systems “will randomly fail” and perfect reliability is economically impossible; others counter that repeated global incidents show architectural and process shortcomings.
Several urge teams to re-evaluate their Cloudflare dependency, multi-CDN/DNS strategies, and contingency plans, while acknowledging leadership often rejects costly redundancy that only pays off during rare events.

Community Tone

Mix of frustration (“Clownflare,” complaints about bot challenges and 5‑nines marketing) and empathy for on-call engineers under intense pressure.
Extensive humor around cascading “DownDetector’s DownDetector” sites, Friday deploys, and “vibe coding,” alongside serious reflection that centralization and rushed changes are raising systemic risk.

Related topics