Sharing details on a recent incident impacting one of our customers
Incident nature and scope
- Many were surprised the failure was entirely on Google Cloud’s side, not customer misconfiguration.
- Clarification that a GCVE (managed VMware) private cloud in one region was deleted, not the entire GCP account, reassured some but still seen as catastrophic.
- Some doubt Google’s claim that this was the first such incident, arguing it’s unlikely a multi‑billion fund was the first affected.
Root cause, defaults, and internal tools
- Strong criticism that an internal tool allowed a blank parameter to default to “auto‑delete after a year,” and that this could bypass usual safety checks.
- Several see this as a systemic process failure, not an isolated bug: internal tools are perceived as under‑scrutinized “tech debt” compared with public APIs.
- Others note Google’s fix (removing the manual operation and further automating deployment) addresses this specific path but not necessarily broader classes of errors.
Safeguards, deletion semantics, and proposed mitigations
- Many are alarmed that there was no soft delete, advance notification, or human approval before deleting an active, large tenant.
- Suggested mitigations:
- Soft‑delete/“disabled but recoverable” state before permanent deletion.
- Mandatory pre‑deletion customer notifications, regardless of trigger.
- Human review for large or in‑use service terminations.
- Cross‑cloud or external backup options as first‑class features.
- Debate over soft delete: some see it as “enterprise 101”; others point to complexity and data‑protection concerns, but most agree high‑level soft delete for entire services is reasonable.
Backups and cross‑cloud resilience
- The customer’s off‑platform or third‑party backups are viewed as the real savior; this prompts strong advocacy for off‑cloud or cross‑cloud backups.
- Some note regulators force such redundancy in financial sectors; many less‑regulated users likely lack this protection.
Trust in GCP vs AWS/Azure
- Repeated claims that AWS is still the reliability “gold standard,” Azure has security issues, and GCP lags in maturity and rigor.
- Some argue all large providers have serious “war stories”; what matters is quality of postmortems and systemic fixes.
Google’s postmortem, culture, and communication
- Many see the postmortem as defensive, focused on denying systemic issues rather than deep remediation.
- Critiques include:
- Lack of broader audits of deletion workflows beyond GCVE.
- Overly self‑congratulatory tone (“most resilient cloud”) despite the failure.
- Nameless, corporate voice instead of accountable leadership sign‑off.
- Speculation that the customer received substantial credits or compensation; details remain unclear.