2025-10-20

AWS outage shows internet users 'at mercy' of too few providers, experts say

Scale and Centralization of AWS

Commenters highlight how much traffic runs through AWS (and CloudFront/Cloudflare), arguing this concentrates systemic risk in a few “sheds in Virginia.”
Some see this as basic economics: low distribution cost → power-law winners (AWS/Azure/GCP).
Others note that many non-cloud options still exist (colo, bare metal, VPS), and centralization is as much lock‑in and marketing as pure technical merit.

Nature of the Outage (us-east-1)

Many stress it was not a total regional blackout: existing EC2/Fargate workloads mostly kept running; control planes and some “global” services failed.
IAM, STS, Lambda, SQS, DynamoDB, EC2 launches, and CloudWatch visibility were common pain points.
Several teams discovered hidden dependencies on us-east-1 endpoints (e.g., IAM), even for workloads in other regions.

Lock-In, Data Gravity, and Cost

Large datasets (terabytes to hundreds of terabytes in S3) are cited as the main practical lock-in, not compute.
Cross-region or multi-cloud replication is considered prohibitively expensive for many, especially due to storage and egress.
Some mention that competitors or AWS will sometimes eat egress fees for migrations, but ongoing duplication cost and complexity remain.

Multi-Cloud / Multi-Region Resilience

Broad agreement that true multi-cloud resilience is rare: cognitive overhead, provider differences, orchestration pain, and data consistency issues.
Cross-region designs are also hard: stateful systems, eventual consistency, and replay/merge of writes after failover.
Many companies consciously accept rare regional outages as a business tradeoff; others argue they misjudge risk and never properly test failover.

Containers and Cloud Lock-In

One view: Docker normalized “just ship a container and let the cloud handle storage/infra,” encouraging deeper reliance on proprietary services.
Counterview: containers are orthogonal to storage, reduce host-management toil, and actually make it easier to move between clouds or on-prem.

Alternatives and On-Prem

Some advocate VPS/local providers or colo to reduce correlated failures and costs, but acknowledge higher operational burden.
Others share that on-prem/colo setups often had more and longer outages due to limited in-house expertise and slower incident response.

Policy, “Experts,” and Systemic Risk

Several criticize media “experts” as non-technical policy or legal figures; others defend their role in assessing geopolitical/systemic dependence on foreign hyperscalers.
A recurring theme: AWS is likely still more reliable than most alternatives; the real issue is how customers architect and test their systems.

Related topics