Sharing details on a recent incident impacting one of our customers

Incident nature and scope

  • Many were surprised the failure was entirely on Google Cloud’s side, not customer misconfiguration.
  • Clarification that a GCVE (managed VMware) private cloud in one region was deleted, not the entire GCP account, reassured some but still seen as catastrophic.
  • Some doubt Google’s claim that this was the first such incident, arguing it’s unlikely a multi‑billion fund was the first affected.

Root cause, defaults, and internal tools

  • Strong criticism that an internal tool allowed a blank parameter to default to “auto‑delete after a year,” and that this could bypass usual safety checks.
  • Several see this as a systemic process failure, not an isolated bug: internal tools are perceived as under‑scrutinized “tech debt” compared with public APIs.
  • Others note Google’s fix (removing the manual operation and further automating deployment) addresses this specific path but not necessarily broader classes of errors.

Safeguards, deletion semantics, and proposed mitigations

  • Many are alarmed that there was no soft delete, advance notification, or human approval before deleting an active, large tenant.
  • Suggested mitigations:
    • Soft‑delete/“disabled but recoverable” state before permanent deletion.
    • Mandatory pre‑deletion customer notifications, regardless of trigger.
    • Human review for large or in‑use service terminations.
    • Cross‑cloud or external backup options as first‑class features.
  • Debate over soft delete: some see it as “enterprise 101”; others point to complexity and data‑protection concerns, but most agree high‑level soft delete for entire services is reasonable.

Backups and cross‑cloud resilience

  • The customer’s off‑platform or third‑party backups are viewed as the real savior; this prompts strong advocacy for off‑cloud or cross‑cloud backups.
  • Some note regulators force such redundancy in financial sectors; many less‑regulated users likely lack this protection.

Trust in GCP vs AWS/Azure

  • Repeated claims that AWS is still the reliability “gold standard,” Azure has security issues, and GCP lags in maturity and rigor.
  • Some argue all large providers have serious “war stories”; what matters is quality of postmortems and systemic fixes.

Google’s postmortem, culture, and communication

  • Many see the postmortem as defensive, focused on denying systemic issues rather than deep remediation.
  • Critiques include:
    • Lack of broader audits of deletion workflows beyond GCVE.
    • Overly self‑congratulatory tone (“most resilient cloud”) despite the failure.
    • Nameless, corporate voice instead of accountable leadership sign‑off.
  • Speculation that the customer received substantial credits or compensation; details remain unclear.