Tell HN: Azure outage
Scope and symptoms of the outage
- Reported globally (Europe, APAC, US). Time of first customer impact around 15:45–16:00 UTC.
- Azure Portal often unreachable or partially loading; some could only access a subset of resources.
- Azure Front Door and Azure CDN (azureedge.net) heavily impacted: slow or failing DNS resolutions, intermittent or no A records, origin timeouts.
- Many Microsoft-owned properties affected: microsoft.com, login.microsoftonline.com (Entra/SSO), VS Code site and updater, learn.microsoft.com, xbox.com, minecraft.net.
- Downstream services broke: corporate SSO, Power Apps, Power Platform, GitHub large runners/Codespaces, Playwright browser downloads, winget, Outlook “modern” client, MS Clarity, various banks, airlines (e.g. check‑in), national digital ID systems, public transport planners, ticket machines, retail tills, and parking/payment systems.
- Core compute often still worked: many report VMs, databases, AKS, App Services without Front Door, and Azure DevOps itself remained functional.
Cause and technical discussion
- Early guesses centered on DNS; initial status messages cited “DNS issues,” later updated to Azure Front Door issues and then an “inadvertent configuration change.”
- Status history describes: bad AFD config deployment, bug in validation safeguards letting it pass, global rollback to “last known good” config, blocked further changes, gradual node recovery.
- Commenters emphasize configuration as the real single point of failure, and note the recurring pattern of “it’s DNS (or BGP).”
Front Door reputation
- Multiple teams report prior regional AFD incidents, often unacknowledged in Service Health.
- Complaints include frequent regional outages, slow TLS handshakes, throughput caps, hard 500 ms origin timeout, and even Microsoft marketing content briefly appearing on customer sites.
- Several organizations had already migrated off AFD (often to Cloudflare) and say this outage validates that choice; others now plan to move.
Status page and communication
- Strong criticism that Azure’s public status page stayed green or minimized impact (initially “portal only”), and was updated slowly (~30+ minutes).
- Some note the irony of status endpoints themselves being down or fronted by the same failing infra.
- Others defend that status pages at hyperscalers often lag due to manual approval and SLA implications; a few contrast this with more transparent smaller providers.
Cloud reliability and strategy debates
- Recent AWS and GCP incidents are frequently referenced; some see this as justification for multi-region or multi-cloud, others say multi-cloud is too complex except at large scale.
- Anecdotes compare hyperscalers unfavorably to smaller VPS hosts and on‑prem setups, though others point out those lack managed services.
- Broader concern that concentrating critical national services (ID, trains, payments) on a single cloud creates highly correlated, society‑wide failure modes.