2025-07-16

Cloudflare 1.1.1.1 Incident on July 14, 2025

Impact and user experience

Some users never noticed the outage because they used DoH (often via cloudflare-dns.com), multi-provider setups, or local resolvers.
Others discovered DNS was broken before Cloudflare’s status page and permanently switched to Google or other providers.
A few felt burned: they had just moved to 1.1.1.1 after ISP DNS issues and now see public resolvers as less reliable overall.
Several point out that traffic not fully returning to prior levels likely reflects client caching and users who never switched back.

Redundancy, “backup DNS”, and client behavior

Many assumed 1.0.0.1 is an independent backup for 1.1.1.1; discussion clarifies both are the same anycast service and were taken down together.
Multiple commenters stress that “secondary DNS” is often not true failover: clients may round-robin, have buggy behavior, or mark servers “down” for a while after timeouts.
Recommendation from many: mix different providers (e.g., 1.1.1.1 + 8.8.8.8 or Quad9), ideally fronted by a local caching/forwarding resolver that can race or health‑check upstreams.

Cloudflare vs other resolvers (privacy, performance, policy)

Debate over whom to trust: Cloudflare vs Google vs ISPs vs Quad9/OpenDNS/dns0/etc.
Arguments for big public resolvers: usually faster, often less censorship than ISPs, well-documented privacy policies.
Arguments against: US jurisdiction, prior privacy controversies, possible logging/telemetry; some prefer local ISPs regulated under national law or European‑run services.
Quad9’s blocking and telemetry policies draw criticism from site operators hit by over‑broad blocking; others see that as acceptable for filtering.

Running your own resolver and local setups

Strong theme: run your own recursive resolver (Unbound, dnsmasq, dnsdist, Technitium, Pi‑hole + Unbound) to reduce dependence on any single provider and improve privacy.
Some report poor latencies when resolving directly from remote authoritative servers (especially in NZ), others say it’s negligible compared to web page bloat.
Various recipes shared: racing upstreams, DoT‑only forwarders, mixing filtered/unfiltered resolvers, and careful interleaving for systemd‑resolved.

DoH/DoT behavior and limitations

DoH usually configured by hostname, which itself must be resolved via some other DNS—creating a bootstrap dependency.
Many platforms (Android, some routers, Windows DoH auto-config) only support a single DoH provider or have awkward fallback semantics, undermining real redundancy.

Cloudflare’s RCA, monitoring, and engineering practices

Root cause as understood from the post: a misconfiguration in a legacy deployment/topology system that incorrectly associated 1.1.1.1/1.0.0.1 with a non‑production service, then propagated globally.
Some praise the transparency and technical detail; others dislike the “legacy/strategic system” corporatese and want crisper plain language.
Significant discussion around the ~5–8 minute detection delay: some think that’s unacceptably slow for a global resolver; operators counter that tighter thresholds cause intolerable false positives at this scale.
Several call for stronger safeguards (e.g., hard‑blocking reassignment of key prefixes, better staged rollouts, clearer separation of failure domains for the two IPs, more central change management).

Routing/BGP side note and anycast concerns

An unrelated BGP origin hijack of 1.1.1.0/24 became visible when Cloudflare withdrew its routes, confusing observers who initially blamed it for the outage.
Discussion touches on RPKI (invalid routes treated as “less preferred” rather than rejected) and the complexities of anycast: it improves latency but can obscure cache behavior and tie multiple “independent” IPs to the same failure domain.

Related topics