2024-10-26

Understanding Round Robin DNS

Round Robin DNS vs Load Balancers

Many argue DNS round robin (RR-DNS) is fundamentally not a real load balancer: DNS’s job is only name→IP, and once IPs are handed out, behavior is entirely client- and resolver-dependent.
Critics say RR-DNS is inadequate for high availability, failover control, and geographic routing; dedicated L4/L7 load balancers or anycast/BGP are preferred where reliability matters.
Others counter that RR-DNS can be a pragmatic choice, especially when load balancers are too costly or complex, and that “perfect” reliability is not required for all services.

Client Behavior, Caching, and TTLs

Reliability and failover heavily depend on how clients and intermediate resolvers handle:
- TTLs (some resolvers clamp to minimums like 1 hour; some clients ignore TTLs entirely).
- Multiple A records (some always pick lowest IP; some don’t retry on failure).
- Timeouts (refused vs silent hang yields very different user experience).
Browsers are generally described as “good enough”: they try multiple IPs and fail over quickly; many non-browser or legacy clients are described as buggy or overly cache-happy (e.g., older Java, some Go HTTP/2 / gRPC behavior, embedded devices).

Use Cases and Context

Several commenters note RR-DNS is acceptable or even excellent when:
- You control clients and can implement smart retry / fallback IP logic.
- The service can tolerate occasional or slow failover (e.g., free/public APIs, internal systems, training environments, SSH endpoints).
For e‑commerce or revenue-critical services, even a small fraction of users failing due to DNS quirks is seen as unacceptable and hard to measure.

Anycast, GeoDNS, and Cloud Providers

Large CDNs successfully use DNS-based approaches, often combined with anycast and sophisticated geo-routing.
Anycast is seen as ideal but out of reach for many small operators due to BGP, IP space, and operational complexity.
Some build custom DNS backends (e.g., with PowerDNS) that do health checks, weighted/geo RR, and failover on the DNS layer, often with low TTLs (~30–60s).

Cloudflare-Specific Behavior

Discussion highlights Cloudflare’s DNS+proxy model, its load balancing product (with monitors, affinity, and failover), and a feature gap between free and paid plans regarding “zero downtime failover.”
After the thread, Cloudflare changes behavior so free accounts also get automatic failover behind their proxy, and the original tests are reported to pass.

Alternative DNS-Based Approaches

SRV records are discussed as a better-designed mechanism (priority/weight), but HTTP never officially adopted them.
New HTTPS/SVCB records offer SRV-like functionality plus TLS/bootstrap benefits and better handling of apex domains; adoption status is still emerging.

Related topics