Tell HN: Azure outage

Scope and symptoms of the outage

  • Reported globally (Europe, APAC, US). Time of first customer impact around 15:45–16:00 UTC.
  • Azure Portal often unreachable or partially loading; some could only access a subset of resources.
  • Azure Front Door and Azure CDN (azureedge.net) heavily impacted: slow or failing DNS resolutions, intermittent or no A records, origin timeouts.
  • Many Microsoft-owned properties affected: microsoft.com, login.microsoftonline.com (Entra/SSO), VS Code site and updater, learn.microsoft.com, xbox.com, minecraft.net.
  • Downstream services broke: corporate SSO, Power Apps, Power Platform, GitHub large runners/Codespaces, Playwright browser downloads, winget, Outlook “modern” client, MS Clarity, various banks, airlines (e.g. check‑in), national digital ID systems, public transport planners, ticket machines, retail tills, and parking/payment systems.
  • Core compute often still worked: many report VMs, databases, AKS, App Services without Front Door, and Azure DevOps itself remained functional.

Cause and technical discussion

  • Early guesses centered on DNS; initial status messages cited “DNS issues,” later updated to Azure Front Door issues and then an “inadvertent configuration change.”
  • Status history describes: bad AFD config deployment, bug in validation safeguards letting it pass, global rollback to “last known good” config, blocked further changes, gradual node recovery.
  • Commenters emphasize configuration as the real single point of failure, and note the recurring pattern of “it’s DNS (or BGP).”

Front Door reputation

  • Multiple teams report prior regional AFD incidents, often unacknowledged in Service Health.
  • Complaints include frequent regional outages, slow TLS handshakes, throughput caps, hard 500 ms origin timeout, and even Microsoft marketing content briefly appearing on customer sites.
  • Several organizations had already migrated off AFD (often to Cloudflare) and say this outage validates that choice; others now plan to move.

Status page and communication

  • Strong criticism that Azure’s public status page stayed green or minimized impact (initially “portal only”), and was updated slowly (~30+ minutes).
  • Some note the irony of status endpoints themselves being down or fronted by the same failing infra.
  • Others defend that status pages at hyperscalers often lag due to manual approval and SLA implications; a few contrast this with more transparent smaller providers.

Cloud reliability and strategy debates

  • Recent AWS and GCP incidents are frequently referenced; some see this as justification for multi-region or multi-cloud, others say multi-cloud is too complex except at large scale.
  • Anecdotes compare hyperscalers unfavorably to smaller VPS hosts and on‑prem setups, though others point out those lack managed services.
  • Broader concern that concentrating critical national services (ID, trains, payments) on a single cloud creates highly correlated, society‑wide failure modes.