Cloudflare 1.1.1.1 Incident on July 14, 2025
Impact and user experience
- Some users never noticed the outage because they used DoH (often via
cloudflare-dns.com), multi-provider setups, or local resolvers. - Others discovered DNS was broken before Cloudflare’s status page and permanently switched to Google or other providers.
- A few felt burned: they had just moved to 1.1.1.1 after ISP DNS issues and now see public resolvers as less reliable overall.
- Several point out that traffic not fully returning to prior levels likely reflects client caching and users who never switched back.
Redundancy, “backup DNS”, and client behavior
- Many assumed 1.0.0.1 is an independent backup for 1.1.1.1; discussion clarifies both are the same anycast service and were taken down together.
- Multiple commenters stress that “secondary DNS” is often not true failover: clients may round-robin, have buggy behavior, or mark servers “down” for a while after timeouts.
- Recommendation from many: mix different providers (e.g., 1.1.1.1 + 8.8.8.8 or Quad9), ideally fronted by a local caching/forwarding resolver that can race or health‑check upstreams.
Cloudflare vs other resolvers (privacy, performance, policy)
- Debate over whom to trust: Cloudflare vs Google vs ISPs vs Quad9/OpenDNS/dns0/etc.
- Arguments for big public resolvers: usually faster, often less censorship than ISPs, well-documented privacy policies.
- Arguments against: US jurisdiction, prior privacy controversies, possible logging/telemetry; some prefer local ISPs regulated under national law or European‑run services.
- Quad9’s blocking and telemetry policies draw criticism from site operators hit by over‑broad blocking; others see that as acceptable for filtering.
Running your own resolver and local setups
- Strong theme: run your own recursive resolver (Unbound, dnsmasq, dnsdist, Technitium, Pi‑hole + Unbound) to reduce dependence on any single provider and improve privacy.
- Some report poor latencies when resolving directly from remote authoritative servers (especially in NZ), others say it’s negligible compared to web page bloat.
- Various recipes shared: racing upstreams, DoT‑only forwarders, mixing filtered/unfiltered resolvers, and careful interleaving for systemd‑resolved.
DoH/DoT behavior and limitations
- DoH usually configured by hostname, which itself must be resolved via some other DNS—creating a bootstrap dependency.
- Many platforms (Android, some routers, Windows DoH auto-config) only support a single DoH provider or have awkward fallback semantics, undermining real redundancy.
Cloudflare’s RCA, monitoring, and engineering practices
- Root cause as understood from the post: a misconfiguration in a legacy deployment/topology system that incorrectly associated 1.1.1.1/1.0.0.1 with a non‑production service, then propagated globally.
- Some praise the transparency and technical detail; others dislike the “legacy/strategic system” corporatese and want crisper plain language.
- Significant discussion around the ~5–8 minute detection delay: some think that’s unacceptably slow for a global resolver; operators counter that tighter thresholds cause intolerable false positives at this scale.
- Several call for stronger safeguards (e.g., hard‑blocking reassignment of key prefixes, better staged rollouts, clearer separation of failure domains for the two IPs, more central change management).
Routing/BGP side note and anycast concerns
- An unrelated BGP origin hijack of 1.1.1.0/24 became visible when Cloudflare withdrew its routes, confusing observers who initially blamed it for the outage.
- Discussion touches on RPKI (invalid routes treated as “less preferred” rather than rejected) and the complexities of anycast: it improves latency but can obscure cache behavior and tie multiple “independent” IPs to the same failure domain.