Cloudflare 1.1.1.1 Incident on July 14, 2025

Impact and user experience

  • Some users never noticed the outage because they used DoH (often via cloudflare-dns.com), multi-provider setups, or local resolvers.
  • Others discovered DNS was broken before Cloudflare’s status page and permanently switched to Google or other providers.
  • A few felt burned: they had just moved to 1.1.1.1 after ISP DNS issues and now see public resolvers as less reliable overall.
  • Several point out that traffic not fully returning to prior levels likely reflects client caching and users who never switched back.

Redundancy, “backup DNS”, and client behavior

  • Many assumed 1.0.0.1 is an independent backup for 1.1.1.1; discussion clarifies both are the same anycast service and were taken down together.
  • Multiple commenters stress that “secondary DNS” is often not true failover: clients may round-robin, have buggy behavior, or mark servers “down” for a while after timeouts.
  • Recommendation from many: mix different providers (e.g., 1.1.1.1 + 8.8.8.8 or Quad9), ideally fronted by a local caching/forwarding resolver that can race or health‑check upstreams.

Cloudflare vs other resolvers (privacy, performance, policy)

  • Debate over whom to trust: Cloudflare vs Google vs ISPs vs Quad9/OpenDNS/dns0/etc.
  • Arguments for big public resolvers: usually faster, often less censorship than ISPs, well-documented privacy policies.
  • Arguments against: US jurisdiction, prior privacy controversies, possible logging/telemetry; some prefer local ISPs regulated under national law or European‑run services.
  • Quad9’s blocking and telemetry policies draw criticism from site operators hit by over‑broad blocking; others see that as acceptable for filtering.

Running your own resolver and local setups

  • Strong theme: run your own recursive resolver (Unbound, dnsmasq, dnsdist, Technitium, Pi‑hole + Unbound) to reduce dependence on any single provider and improve privacy.
  • Some report poor latencies when resolving directly from remote authoritative servers (especially in NZ), others say it’s negligible compared to web page bloat.
  • Various recipes shared: racing upstreams, DoT‑only forwarders, mixing filtered/unfiltered resolvers, and careful interleaving for systemd‑resolved.

DoH/DoT behavior and limitations

  • DoH usually configured by hostname, which itself must be resolved via some other DNS—creating a bootstrap dependency.
  • Many platforms (Android, some routers, Windows DoH auto-config) only support a single DoH provider or have awkward fallback semantics, undermining real redundancy.

Cloudflare’s RCA, monitoring, and engineering practices

  • Root cause as understood from the post: a misconfiguration in a legacy deployment/topology system that incorrectly associated 1.1.1.1/1.0.0.1 with a non‑production service, then propagated globally.
  • Some praise the transparency and technical detail; others dislike the “legacy/strategic system” corporatese and want crisper plain language.
  • Significant discussion around the ~5–8 minute detection delay: some think that’s unacceptably slow for a global resolver; operators counter that tighter thresholds cause intolerable false positives at this scale.
  • Several call for stronger safeguards (e.g., hard‑blocking reassignment of key prefixes, better staged rollouts, clearer separation of failure domains for the two IPs, more central change management).

Routing/BGP side note and anycast concerns

  • An unrelated BGP origin hijack of 1.1.1.0/24 became visible when Cloudflare withdrew its routes, confusing observers who initially blamed it for the outage.
  • Discussion touches on RPKI (invalid routes treated as “less preferred” rather than rejected) and the complexities of anycast: it improves latency but can obscure cache behavior and tie multiple “independent” IPs to the same failure domain.