Cloudflare was down
Scope and Nature of the Outage
- Large portion of the internet briefly returned plain 500 errors branded “cloudflare”: npm, Supabase, Notion, Shopify, Claude, Perplexity, LinkedIn, major crypto exchanges, media and anime sites, documentation sites, etc.
- Some Cloudflare users were unaffected: many small sites, some Workers / Tunnels / R2 / KV use-cases, and non-proxied setups stayed up; in some cases websockets worked while main sites failed.
- Cloudflare’s own website, dashboard and APIs were down; some third-party services that depend on Cloudflare (e.g., Porkbun DNS UI, Docker Hub, various SaaS) also failed.
Status Pages, SLAs, and Transparency
- Cloudflare’s status page initially showed only “scheduled maintenance” (Chicago) and later a narrow “dashboard/API issues” incident, conflicting with widespread customer 500s.
- Many argue big providers’ status pages are “for show,” incentivized by SLAs to under-report outages as “degraded performance” and to delay flipping to “down.”
- Others stress that keeping customers informed is a core part of incident response, and status pages should be independent, automated where possible, and even hosted off-provider.
Centralization and Single Points of Failure
- Multiple comments note how deeply Cloudflare has become a single point of failure: when it breaks, “half the internet” appears down, including monitoring sites like DownDetector.
- Some say Cloudflare’s free tier and integrated CDN/WAF/DDoS offering explain this dominance; alternatives (Fastly, bunny.net, CloudFront, etc.) often cost more or are more complex.
- Debate over whether it’s reasonable for non-critical businesses to accept rare global outages versus critical sectors (banks, hospitals, ATC) that must design around any third‑party SPOF.
Cloudflare’s Explanation and Engineering Practices
- Later incident note: a change to the Web Application Firewall’s request parsing, rolled out to mitigate a new React Server Components vulnerability, made Cloudflare’s network unavailable for several minutes; explicitly “not an attack.”
- Many note a recurring pattern: global outages triggered by config/WAF changes without apparent staged rollout or canaries; criticism that this contradicts industry best practices for critical infra.
- Discussion about Rust vs previous stacks concludes the problems are in operational discipline, configuration and rollout strategy, not in the language itself.
Reliability Trends, Architecture, and Industry Culture
- Concern that this is the second or third major Cloudflare incident in weeks, eroding trust and making them look like “the weak link of the internet.”
- Some argue internet-scale systems “will randomly fail” and perfect reliability is economically impossible; others counter that repeated global incidents show architectural and process shortcomings.
- Several urge teams to re-evaluate their Cloudflare dependency, multi-CDN/DNS strategies, and contingency plans, while acknowledging leadership often rejects costly redundancy that only pays off during rare events.
Community Tone
- Mix of frustration (“Clownflare,” complaints about bot challenges and 5‑nines marketing) and empathy for on-call engineers under intense pressure.
- Extensive humor around cascading “DownDetector’s DownDetector” sites, Friday deploys, and “vibe coding,” alongside serious reflection that centralization and rushed changes are raising systemic risk.