2025-06-14

Google Cloud Incident Report – 2025-06-13

Root Cause and Nature of the Bug

Many readers see the outage as rooted in a simple, “junior-level” null pointer / blank-field handling bug in a critical quota path.
Others argue the null pointer is incidental; the real issue is that a new global quota policy and its schema change were insufficiently validated.
Several commenters emphasize that such “this can’t happen” assumptions are common in large systems, even with experienced engineers.

Testing, Rollout, and Config vs Binary

There’s broad criticism that standard defenses all failed: no effective test for the bad input, no feature flag gating, no gradual rollout of the policy that activated the new code path.
Multiple posters highlight that while binaries/configs typically use staged rollouts and canarying, this change came via database‑backed policy replicated globally within seconds, bypassing those safeguards.
Some see this as “another CrowdStrike”: a global config mechanism with no blast-radius limiting.

Feature Flags, “Red Button,” and CI/CD

The postmortem’s promise to feature‑flag all critical binary changes is viewed by some as over‑correction that could hurt productivity; others note far stricter norms in aerospace.
Confusion and skepticism around the “red button”: was it truly pre‑wired or itself a change that had to be prepared and deployed?
Several mention feature‑flag systems that enforce gradual rollouts and cleanup, but also note the combinatorial complexity such flags introduce.

Null Pointers, Languages, and Type Systems

Large subthread debates whether languages with non‑null/Option types (Rust, Haskell, Kotlin, etc.) would have prevented this.
One side: stricter type systems force explicit handling and make these bugs rarer; another: you can still crash via .unwrap()/expect() or equivalent bad assumptions.
Go and C++ error‑handling and nil semantics are criticized; others counter that no language fully eliminates logical mistakes.

Throttling, Backoff, and Recovery Dynamics

Lack of exponential backoff and load‑shedding is widely criticized; restart storms overloaded Spanner and prolonged recovery, especially in us‑central1.
Some note that startup paths often lack backoff logic even when request paths have it, and that quotas or limits that are fine in steady‑state can fail badly during mass restarts.

Culture, Leadership, and Reliability Expectations

Several self‑identified insiders describe long‑term leadership pressure for velocity, offshoring, and “flashy” projects over maintenance, eroding quality standards.
Others argue that at Google scale, rare global incidents are inevitable and not proof of systemic incompetence, though this one “looks like a small‑company mistake.”
Commenters split between seeing FAANG reliability as overhyped “myth/PR” vs. still significantly better than average despite visible failures.

Fail‑Open vs Fail‑Closed and Security

The plan to “fail open” for quota checks worries some from a security perspective; others assume this is limited to quota, not authz.
Several note that fail‑open vs fail‑closed is a deep policy tradeoff that’s easy to get wrong under outage pressure.

Related topics