2024-05-24

Sharing details on a recent incident impacting one of our customers

Incident nature and scope

Many were surprised the failure was entirely on Google Cloud’s side, not customer misconfiguration.
Clarification that a GCVE (managed VMware) private cloud in one region was deleted, not the entire GCP account, reassured some but still seen as catastrophic.
Some doubt Google’s claim that this was the first such incident, arguing it’s unlikely a multi‑billion fund was the first affected.

Root cause, defaults, and internal tools

Strong criticism that an internal tool allowed a blank parameter to default to “auto‑delete after a year,” and that this could bypass usual safety checks.
Several see this as a systemic process failure, not an isolated bug: internal tools are perceived as under‑scrutinized “tech debt” compared with public APIs.
Others note Google’s fix (removing the manual operation and further automating deployment) addresses this specific path but not necessarily broader classes of errors.

Safeguards, deletion semantics, and proposed mitigations

Many are alarmed that there was no soft delete, advance notification, or human approval before deleting an active, large tenant.
Suggested mitigations:
- Soft‑delete/“disabled but recoverable” state before permanent deletion.
- Mandatory pre‑deletion customer notifications, regardless of trigger.
- Human review for large or in‑use service terminations.
- Cross‑cloud or external backup options as first‑class features.
Debate over soft delete: some see it as “enterprise 101”; others point to complexity and data‑protection concerns, but most agree high‑level soft delete for entire services is reasonable.

Backups and cross‑cloud resilience

The customer’s off‑platform or third‑party backups are viewed as the real savior; this prompts strong advocacy for off‑cloud or cross‑cloud backups.
Some note regulators force such redundancy in financial sectors; many less‑regulated users likely lack this protection.

Trust in GCP vs AWS/Azure

Repeated claims that AWS is still the reliability “gold standard,” Azure has security issues, and GCP lags in maturity and rigor.
Some argue all large providers have serious “war stories”; what matters is quality of postmortems and systemic fixes.

Google’s postmortem, culture, and communication

Many see the postmortem as defensive, focused on denying systemic issues rather than deep remediation.
Critiques include:
- Lack of broader audits of deletion workflows beyond GCVE.
- Overly self‑congratulatory tone (“most resilient cloud”) despite the failure.
- Nameless, corporate voice instead of accountable leadership sign‑off.
Speculation that the customer received substantial credits or compensation; details remain unclear.

Related topics