2025-10-29

Tell HN: Azure outage

Scope and symptoms of the outage

Reported globally (Europe, APAC, US). Time of first customer impact around 15:45–16:00 UTC.
Azure Portal often unreachable or partially loading; some could only access a subset of resources.
Azure Front Door and Azure CDN (azureedge.net) heavily impacted: slow or failing DNS resolutions, intermittent or no A records, origin timeouts.
Many Microsoft-owned properties affected: microsoft.com, login.microsoftonline.com (Entra/SSO), VS Code site and updater, learn.microsoft.com, xbox.com, minecraft.net.
Downstream services broke: corporate SSO, Power Apps, Power Platform, GitHub large runners/Codespaces, Playwright browser downloads, winget, Outlook “modern” client, MS Clarity, various banks, airlines (e.g. check‑in), national digital ID systems, public transport planners, ticket machines, retail tills, and parking/payment systems.
Core compute often still worked: many report VMs, databases, AKS, App Services without Front Door, and Azure DevOps itself remained functional.

Cause and technical discussion

Early guesses centered on DNS; initial status messages cited “DNS issues,” later updated to Azure Front Door issues and then an “inadvertent configuration change.”
Status history describes: bad AFD config deployment, bug in validation safeguards letting it pass, global rollback to “last known good” config, blocked further changes, gradual node recovery.
Commenters emphasize configuration as the real single point of failure, and note the recurring pattern of “it’s DNS (or BGP).”

Front Door reputation

Multiple teams report prior regional AFD incidents, often unacknowledged in Service Health.
Complaints include frequent regional outages, slow TLS handshakes, throughput caps, hard 500 ms origin timeout, and even Microsoft marketing content briefly appearing on customer sites.
Several organizations had already migrated off AFD (often to Cloudflare) and say this outage validates that choice; others now plan to move.

Status page and communication

Strong criticism that Azure’s public status page stayed green or minimized impact (initially “portal only”), and was updated slowly (~30+ minutes).
Some note the irony of status endpoints themselves being down or fronted by the same failing infra.
Others defend that status pages at hyperscalers often lag due to manual approval and SLA implications; a few contrast this with more transparent smaller providers.

Cloud reliability and strategy debates

Recent AWS and GCP incidents are frequently referenced; some see this as justification for multi-region or multi-cloud, others say multi-cloud is too complex except at large scale.
Anecdotes compare hyperscalers unfavorably to smaller VPS hosts and on‑prem setups, though others point out those lack managed services.
Broader concern that concentrating critical national services (ID, trains, payments) on a single cloud creates highly correlated, society‑wide failure modes.

Related topics