AWS North Virginia data center outage – resolved

Cooling failure and cascading effects

  • Outage attributed to a failed cooling loop in one North Virginia data center / AZ within us‑east‑1.
  • Commenters describe how cooling is typically overprovisioned (N+1, N+2, etc.) but still vulnerable when multiple components fail while load ramps up.
  • Detailed scenario: overlapping maintenance, latent faults, and a sudden batch workload can push cooling over the edge, triggering a cascade of failures and forced load shedding.
  • Some expect AWS to throttle workloads before thermal runaway; others note this is technically hard with modern hardware and can break assumptions of homogeneous performance.
  • AWS reportedly did load‑shedding, including turning off non‑preemptible workloads, which is why the event was visible.

us‑east‑1 as structural weak point

  • Many comments reiterate that us‑east‑1 is the oldest, largest, and most complex region, with many internal “control plane” services and global services depending on it (IAM, Route 53, others).
  • Several note that an impairment in us‑east‑1 can affect workloads in other regions via these shared control planes, even when their local data planes are fine.
  • One participant challenges claims that IAM/STS are fully centralized, emphasizing AWS’s “static stability” model: control planes centralized, data planes regional.

Redundancy, AZs, and multi‑region / multi‑cloud

  • Some argue this incident only hit a single AZ and that proper multi‑AZ design would have mitigated it.
  • Others observe that many companies still run single‑AZ or single‑region for cost and latency reasons, including real‑time trading, crypto exchanges, and games.
  • Multi‑cloud is discussed as “true” high availability but widely seen as prohibitively expensive and complex for most organizations.

Location and cooling strategy debates

  • Question: why not build data centers near oceans for simple water cooling?
  • Responses cite saltwater corrosion, fouling (jellyfish, mussels, debris), high coastal land costs, storm risks, and power‑access issues.
  • History and economics explain why Northern Virginia became a huge hub (early internet exchange, clustering effects, peering incentives).

Perceptions of AWS reliability

  • Some claim us‑east‑1 fails more often and is the “Achilles heel of the internet”; others say its outage frequency and impact are overstated.
  • There’s discussion of “safety in numbers” (when us‑east‑1 is down, everyone is down) versus the competitive advantage of staying up when others fail.