Don't rent the cloud, own instead
Risk, reliability, and disaster planning
- Multiple commenters ask how a single in‑office data center handles disasters: fire, flooding, power failure, earthquakes.
- Past incidents (e.g., OVH fire, burst pipes) are cited to argue that “one DC” without geographic redundancy is inherently fragile; many say “you need at least two.”
- Some note comma’s workloads are offline training rather than user-facing, so weeks of downtime may be tolerable if offsite backups exist.
- Others question humidity and “outside air” cooling, pointing to ASHRAE guidelines and long‑term hardware damage from dust, static, and moisture.
Cloud vs on‑prem economics
- Repeated theme: at large, steady GPU/HPC scale, on‑prem is dramatically cheaper than hyperscale cloud (10–20× is mentioned).
- Counterpoint: risk‑adjusted and bureaucracy‑adjusted costs often favor opex cloud, especially for public sector and mid‑sized enterprises that struggle to get capex approved.
- Several note cloud TCO calculators heavily overestimate on‑prem costs and assume very high hardware prices and labor. Others argue many orgs undercount real on‑prem work (24/7, spares, security, audits).
- Capex vs opex is framed as partly accounting/political: recurring SaaS and cloud line items are often easier to approve than a big one‑time spend, regardless of pure math.
Colocation, bare metal, and “managed private cloud”
- Many suggest intermediate options: colocation with owned servers, rented dedicated servers (e.g., Hetzner/OVH), or third‑party “managed private cloud” on bare metal.
- These are described as giving 50–90% of the savings of full on‑prem with far less operational burden, especially if paired with Kubernetes or similar orchestration.
- Real‑world anecdotes: multi‑rack colos saving millions vs cloud; others saying colo in expensive cities can approach cloud pricing.
Operational complexity and skills
- One camp insists running servers/colos is “not that hard” and that cloud operational work (APIs, managed services, outages) is comparably complex.
- The other camp highlights hidden work: 24/7 on‑call, hardware failures, backups, DB management, security hardening, audits, and the pain when senior infra people leave.
- Several point out that you don’t escape ops by using cloud—you just shift it from racking to managing complex cloud stacks and proprietary services.
Startups, scale, and lock‑in
- Common model described: start on cloud to validate product; consider bare metal/colo/on‑prem only once infra spend is in the “multiple FTEs per year” range.
- Some warn that easy cloud onboarding plus proprietary managed services create lock‑in, making later migration very hard and expensive.
- For “compute‑native” companies (ML training, HPC), on‑prem or colo is seen as a core competency and a major competitive lever; for most SaaS or line‑of‑business apps, the risk of running a DC is viewed as unjustified.
Engineering culture, incentives, and sovereignty
- Supporters of owning hardware stress: deeper technical skills, better optimization incentives when compute is fixed, and psychological benefits of control.
- Skeptics argue many orgs don’t have the talent or desire; they should focus on product, not “building their own Jira and their own data center.”
- EU commenters note sovereignty and US CLOUD Act concerns as an additional driver for on‑prem, EU clouds, or research HPC, especially for health/financial data.