2025-06-12

GCP Outage

Scope of the outage

Users reported widespread failures across many Google Cloud services: Console, GCS, Cloud Run, Cloud SQL, BigQuery, IAM, GKE, Cloud Build, Dataproc, Cloud Data Fusion, Firebase Auth/Firestore/Hosting/Data Connect, Cloud Shell, Cloud Workstations, Vertex AI Search, Gemini API, reCAPTCHA, and Google Meet/Chat/Maps/Street View/Nest/RCS messaging.
Impact spanned many regions (us‑west1, us‑central1, us‑east1, Europe including Frankfurt/Netherlands, Asia including South Korea and India).
Many third‑party platforms broke as collateral damage: Anthropic/Claude, Supabase, Sentry, npm/Yarn, Docker Hub (partially), Expo/FCM‑based systems, Discord uploads, Twitch, Spotify, Mapbox, xAI, various AI dev tools, and more.
Some workloads (e.g., App Engine, intra‑VPC traffic) kept working, indicating control-plane/auth failures more than pure compute loss.

Status pages and transparency

For a long initial window, GCP’s public status page showed “No major incidents” and all‑green checks while users saw pervasive errors.
Firebase’s status site explicitly cited a “Google Cloud global outage” before the GCP page acknowledged anything; that wording was later edited to remove “Google Cloud.”
Multiple comments assert major cloud status pages are manual, tightly controlled by PR/Legal/VP‑level approval because of SLA and compensation implications.
Many participants say they don’t trust first‑party status pages at all, preferring Downdetector, social media, or community chatter, despite acknowledging crowdsourced noise.

Root cause theories and dependencies

The official incident later attributed the GCP issues to an Identity and Access Management service problem; errors like “visibility check was unavailable” and token refresh failures matched this.
One contributor tied the symptoms to an internal Google service (“Chemist”) that enforces project/billing/abuse/quotas; this is speculative but fits the pattern of global auth/policy failure.
Cloudflare reported its Workers KV going offline due to a “3rd party service that is a key dependency,” strongly suspected in the thread to be a GCP service.
Some early speculation about BGP or backbone issues was later considered unlikely by several participants; the consensus leans toward a higher‑level auth/control‑plane failure, though anything beyond the published incident report is unclear.

Reliability, architecture, and cloud risk

Discussion highlights how a single global control‑plane/SaaS dependency (IAM, auth, KV, config services) can bypass region/zone redundancy: services are “up” but cannot authorize, so they effectively fail everywhere.
People note that large systems are always partially degraded; the debate is where to set the threshold for calling an incident and how to represent partial vs widespread impact.
Some argue status pages should at least say “degraded, some users seeing errors”; others emphasize the non‑binary nature of “up/down” and the difficulty of automating meaningful, low‑false‑positive health signals at this scale.
Several engineers discuss circuit breakers and backoff as key patterns: external SaaS outages can otherwise cascade into self‑inflicted failures even on unaffected clouds.

Centralization, SLAs, and business incentives

There is broad cynicism that uptime metrics and “five nines” claims are massaged via optimistic definitions and underreporting; SLAs are seen as mostly contractual escape hatches rather than guarantees of reliability.
Some argue big customers likely get accurate private incident info even when public dashboards lag or downplay issues.
The outage is cited as a lesson in “counterparty risk” and over‑reliance on a few giant cloud/edge providers; a few commenters say their mostly self‑hosted stacks rode out the event unnoticed.

Cultural and usage observations

The outage made many developers confront how dependent they’ve become on cloud‑hosted AI tools (Gemini, Claude, etc.) for everyday coding and ticket triage.
The RCS outage, contrasted with relatively robust SMS history, is used as an example of how centralization (e.g., Google’s hosted Jibe backend) can create new single points of failure.
Several note the irony that even monitoring, auth, status, and CDN/control‑plane systems (Cloudflare, Firebase, IAM, KV stores) themselves depended on the same clouds they’re supposed to make resilient.

Related topics