Understanding Round Robin DNS
Round Robin DNS vs Load Balancers
- Many argue DNS round robin (RR-DNS) is fundamentally not a real load balancer: DNS’s job is only name→IP, and once IPs are handed out, behavior is entirely client- and resolver-dependent.
- Critics say RR-DNS is inadequate for high availability, failover control, and geographic routing; dedicated L4/L7 load balancers or anycast/BGP are preferred where reliability matters.
- Others counter that RR-DNS can be a pragmatic choice, especially when load balancers are too costly or complex, and that “perfect” reliability is not required for all services.
Client Behavior, Caching, and TTLs
- Reliability and failover heavily depend on how clients and intermediate resolvers handle:
- TTLs (some resolvers clamp to minimums like 1 hour; some clients ignore TTLs entirely).
- Multiple A records (some always pick lowest IP; some don’t retry on failure).
- Timeouts (refused vs silent hang yields very different user experience).
- Browsers are generally described as “good enough”: they try multiple IPs and fail over quickly; many non-browser or legacy clients are described as buggy or overly cache-happy (e.g., older Java, some Go HTTP/2 / gRPC behavior, embedded devices).
Use Cases and Context
- Several commenters note RR-DNS is acceptable or even excellent when:
- You control clients and can implement smart retry / fallback IP logic.
- The service can tolerate occasional or slow failover (e.g., free/public APIs, internal systems, training environments, SSH endpoints).
- For e‑commerce or revenue-critical services, even a small fraction of users failing due to DNS quirks is seen as unacceptable and hard to measure.
Anycast, GeoDNS, and Cloud Providers
- Large CDNs successfully use DNS-based approaches, often combined with anycast and sophisticated geo-routing.
- Anycast is seen as ideal but out of reach for many small operators due to BGP, IP space, and operational complexity.
- Some build custom DNS backends (e.g., with PowerDNS) that do health checks, weighted/geo RR, and failover on the DNS layer, often with low TTLs (~30–60s).
Cloudflare-Specific Behavior
- Discussion highlights Cloudflare’s DNS+proxy model, its load balancing product (with monitors, affinity, and failover), and a feature gap between free and paid plans regarding “zero downtime failover.”
- After the thread, Cloudflare changes behavior so free accounts also get automatic failover behind their proxy, and the original tests are reported to pass.
Alternative DNS-Based Approaches
- SRV records are discussed as a better-designed mechanism (priority/weight), but HTTP never officially adopted them.
- New HTTPS/SVCB records offer SRV-like functionality plus TLS/bootstrap benefits and better handling of apex domains; adoption status is still emerging.