Incident Report: Railway Blocked by Google Cloud [resolved]

Incident and suspected cause

  • Railway’s outage traced to its GCP account being put into a “restricted” state; a GCP project was reportedly deleted without warning, removing CloudSQL, overflow VMs, and API access.
  • Railway reps say they had prior assurances from Google after an earlier auto‑rate‑limit incident that this wouldn’t recur, and that restoration took minutes once a bug was filed, but damage to customers lasted hours.
  • Exact trigger is still unclear in the thread (possibilities floated: abuse reports, payment issues, anti‑fraud/AI systems, or customer workloads), and several commenters stress we only see one side.

Railway architecture and “not a cloud on a cloud”

  • Railway markets itself as owning its own metal and not “building a cloud on another cloud.”
  • Commenters discover core databases and some networking still depended on GCP, contradicting that narrative in their view.
  • Railway explains they exited most compute to their own DCs (plus AWS), but left DBs on CloudSQL for HA/replication and to avoid circular dependency on their own infra; in hindsight this became the critical single point of failure.
  • Some see this as understandable technical tradeoff; others call it deceptive or dangerously backwards (DB last to migrate).

Trust in GCP: bans and support

  • Many recount prior GCP suspensions (including smaller accounts and a Korean government org) and the UniSuper incident where a misconfig deleted a whole private cloud subscription.
  • General themes: aggressive automated enforcement, weak human support/CSM effectiveness, and fear that even high‑spend accounts can be “auto‑yeeted.”
  • A minority report good GCP relationships and argue such blow‑ups usually follow earlier warning signs or poor account hygiene.

Redundancy, multi‑cloud, and backups

  • Strong chorus: “all eggs in one basket” is risky, especially for critical control‑plane components like auth, DNS, and primary DBs.
  • Others counter that true multi‑cloud is extremely rare, complex, and often unjustified for startups; you typically “start with one egg.”
  • Several emphasize off‑provider backups and separate billing entities; 3‑2‑1 backup interpreted as “different accounts/providers,” not just extra buckets.
  • Discussion notes that shutting an account/subscription can be a global single point of failure despite multi‑region setups.

Comparisons to AWS/Azure and other hosts

  • Many say they’ve never seen AWS/Azure silently nuke accounts at this scale; AWS is criticized for regional outages (especially us‑east‑1) but praised for warnings and softer enforcement.
  • Some note other providers (Hetzner, OVH) are also aggressive on KYC/abuse; AWS/Azure are framed as the safer outliers for account risk.
  • Alternatives floated: Render, Vercel, Fly.io, DigitalOcean, Hetzner, Coolify, self‑hosted/colo, even rsync‑style offsite storage.

User impact and reactions

  • Hobbyists and small customers experienced long downtime, invalid TLS certs, and 502s; some had to manually redeploy even after Railway marked the incident “resolved.”
  • New and existing customers describe this as a “wake‑up call”; several immediately migrated to other platforms, saying trust is broken.
  • Others express continued sympathy for Railway but resolve not to run serious businesses on such a young platform.

Abuse, anti‑fraud, and free tiers

  • Some operators complain about heavy spam/abuse from Railway IPs and say its abuse prevention is weak.
  • Railway previously acknowledged internal anti‑fraud misfires that “hard killed” legitimate workloads.
  • Broader debate: free/cheap compute inevitably attracts abuse; strict KYC and anti‑fraud reduce that but hurt growth and UX.