Cloudflare was down

Scope and Nature of the Outage

  • Large portion of the internet briefly returned plain 500 errors branded “cloudflare”: npm, Supabase, Notion, Shopify, Claude, Perplexity, LinkedIn, major crypto exchanges, media and anime sites, documentation sites, etc.
  • Some Cloudflare users were unaffected: many small sites, some Workers / Tunnels / R2 / KV use-cases, and non-proxied setups stayed up; in some cases websockets worked while main sites failed.
  • Cloudflare’s own website, dashboard and APIs were down; some third-party services that depend on Cloudflare (e.g., Porkbun DNS UI, Docker Hub, various SaaS) also failed.

Status Pages, SLAs, and Transparency

  • Cloudflare’s status page initially showed only “scheduled maintenance” (Chicago) and later a narrow “dashboard/API issues” incident, conflicting with widespread customer 500s.
  • Many argue big providers’ status pages are “for show,” incentivized by SLAs to under-report outages as “degraded performance” and to delay flipping to “down.”
  • Others stress that keeping customers informed is a core part of incident response, and status pages should be independent, automated where possible, and even hosted off-provider.

Centralization and Single Points of Failure

  • Multiple comments note how deeply Cloudflare has become a single point of failure: when it breaks, “half the internet” appears down, including monitoring sites like DownDetector.
  • Some say Cloudflare’s free tier and integrated CDN/WAF/DDoS offering explain this dominance; alternatives (Fastly, bunny.net, CloudFront, etc.) often cost more or are more complex.
  • Debate over whether it’s reasonable for non-critical businesses to accept rare global outages versus critical sectors (banks, hospitals, ATC) that must design around any third‑party SPOF.

Cloudflare’s Explanation and Engineering Practices

  • Later incident note: a change to the Web Application Firewall’s request parsing, rolled out to mitigate a new React Server Components vulnerability, made Cloudflare’s network unavailable for several minutes; explicitly “not an attack.”
  • Many note a recurring pattern: global outages triggered by config/WAF changes without apparent staged rollout or canaries; criticism that this contradicts industry best practices for critical infra.
  • Discussion about Rust vs previous stacks concludes the problems are in operational discipline, configuration and rollout strategy, not in the language itself.

Reliability Trends, Architecture, and Industry Culture

  • Concern that this is the second or third major Cloudflare incident in weeks, eroding trust and making them look like “the weak link of the internet.”
  • Some argue internet-scale systems “will randomly fail” and perfect reliability is economically impossible; others counter that repeated global incidents show architectural and process shortcomings.
  • Several urge teams to re-evaluate their Cloudflare dependency, multi-CDN/DNS strategies, and contingency plans, while acknowledging leadership often rejects costly redundancy that only pays off during rare events.

Community Tone

  • Mix of frustration (“Clownflare,” complaints about bot challenges and 5‑nines marketing) and empathy for on-call engineers under intense pressure.
  • Extensive humor around cascading “DownDetector’s DownDetector” sites, Friday deploys, and “vibe coding,” alongside serious reflection that centralization and rushed changes are raising systemic risk.