Understanding Round Robin DNS

Round Robin DNS vs Load Balancers

  • Many argue DNS round robin (RR-DNS) is fundamentally not a real load balancer: DNS’s job is only name→IP, and once IPs are handed out, behavior is entirely client- and resolver-dependent.
  • Critics say RR-DNS is inadequate for high availability, failover control, and geographic routing; dedicated L4/L7 load balancers or anycast/BGP are preferred where reliability matters.
  • Others counter that RR-DNS can be a pragmatic choice, especially when load balancers are too costly or complex, and that “perfect” reliability is not required for all services.

Client Behavior, Caching, and TTLs

  • Reliability and failover heavily depend on how clients and intermediate resolvers handle:
    • TTLs (some resolvers clamp to minimums like 1 hour; some clients ignore TTLs entirely).
    • Multiple A records (some always pick lowest IP; some don’t retry on failure).
    • Timeouts (refused vs silent hang yields very different user experience).
  • Browsers are generally described as “good enough”: they try multiple IPs and fail over quickly; many non-browser or legacy clients are described as buggy or overly cache-happy (e.g., older Java, some Go HTTP/2 / gRPC behavior, embedded devices).

Use Cases and Context

  • Several commenters note RR-DNS is acceptable or even excellent when:
    • You control clients and can implement smart retry / fallback IP logic.
    • The service can tolerate occasional or slow failover (e.g., free/public APIs, internal systems, training environments, SSH endpoints).
  • For e‑commerce or revenue-critical services, even a small fraction of users failing due to DNS quirks is seen as unacceptable and hard to measure.

Anycast, GeoDNS, and Cloud Providers

  • Large CDNs successfully use DNS-based approaches, often combined with anycast and sophisticated geo-routing.
  • Anycast is seen as ideal but out of reach for many small operators due to BGP, IP space, and operational complexity.
  • Some build custom DNS backends (e.g., with PowerDNS) that do health checks, weighted/geo RR, and failover on the DNS layer, often with low TTLs (~30–60s).

Cloudflare-Specific Behavior

  • Discussion highlights Cloudflare’s DNS+proxy model, its load balancing product (with monitors, affinity, and failover), and a feature gap between free and paid plans regarding “zero downtime failover.”
  • After the thread, Cloudflare changes behavior so free accounts also get automatic failover behind their proxy, and the original tests are reported to pass.

Alternative DNS-Based Approaches

  • SRV records are discussed as a better-designed mechanism (priority/weight), but HTTP never officially adopted them.
  • New HTTPS/SVCB records offer SRV-like functionality plus TLS/bootstrap benefits and better handling of apex domains; adoption status is still emerging.