AWS North Virginia data center outage – resolved
Cooling failure and cascading effects
- Outage attributed to a failed cooling loop in one North Virginia data center / AZ within us‑east‑1.
- Commenters describe how cooling is typically overprovisioned (N+1, N+2, etc.) but still vulnerable when multiple components fail while load ramps up.
- Detailed scenario: overlapping maintenance, latent faults, and a sudden batch workload can push cooling over the edge, triggering a cascade of failures and forced load shedding.
- Some expect AWS to throttle workloads before thermal runaway; others note this is technically hard with modern hardware and can break assumptions of homogeneous performance.
- AWS reportedly did load‑shedding, including turning off non‑preemptible workloads, which is why the event was visible.
us‑east‑1 as structural weak point
- Many comments reiterate that us‑east‑1 is the oldest, largest, and most complex region, with many internal “control plane” services and global services depending on it (IAM, Route 53, others).
- Several note that an impairment in us‑east‑1 can affect workloads in other regions via these shared control planes, even when their local data planes are fine.
- One participant challenges claims that IAM/STS are fully centralized, emphasizing AWS’s “static stability” model: control planes centralized, data planes regional.
Redundancy, AZs, and multi‑region / multi‑cloud
- Some argue this incident only hit a single AZ and that proper multi‑AZ design would have mitigated it.
- Others observe that many companies still run single‑AZ or single‑region for cost and latency reasons, including real‑time trading, crypto exchanges, and games.
- Multi‑cloud is discussed as “true” high availability but widely seen as prohibitively expensive and complex for most organizations.
Location and cooling strategy debates
- Question: why not build data centers near oceans for simple water cooling?
- Responses cite saltwater corrosion, fouling (jellyfish, mussels, debris), high coastal land costs, storm risks, and power‑access issues.
- History and economics explain why Northern Virginia became a huge hub (early internet exchange, clustering effects, peering incentives).
Perceptions of AWS reliability
- Some claim us‑east‑1 fails more often and is the “Achilles heel of the internet”; others say its outage frequency and impact are overstated.
- There’s discussion of “safety in numbers” (when us‑east‑1 is down, everyone is down) versus the competitive advantage of staying up when others fail.