We moved from AWS to Hetzner, saved 90%, kept ISO 27001 with Ansible

Cost, Scope, and When Migration Makes Sense

  • Reported saving: ~90% vs AWS, from ~$24k/year to ~$2.4k/year, for a modest but real production workload (10–20k DAU, ~1.5–2k peak concurrent).
  • Some argue that if the whole company runs on ~$200/month infra, AWS was overkill or adopted too early; others note that for bootstrapped EU companies $20k/year is very material.
  • Several commenters have seen similar 5–10× savings moving from AWS/Azure to Hetzner/DO/VPS, especially when replacing RDS and unused/forgotten resources.

DIY vs Managed Cloud and Operational Burden

  • Critics highlight hidden costs: time to rebuild AWS features (RDS, IAM, monitoring, DR), 24/7 responsibilities, and long‑term maintenance complexity.
  • OP claims infra effort stayed ~0.1 FTE before and after, thanks to heavy use of Terraform, Ansible, and automated monitoring/alerting; migration was ~0.5 FTE for a few months.
  • Some say AWS doesn’t truly give 24/7 app support, just infra SLAs—you still need in‑house expertise and cost control. Others counter that services like RDS, SQS, S3, IAM, ECS, IoT, etc. meaningfully reduce cognitive and operational load.

Compliance, Sovereignty, and ISO 27001

  • In EU contexts, ISO 27001 is often a hard requirement; several describe detailed mappings from Terraform/Ansible setup to ISO controls (asset inventory, hardening, logging, DR, crypto, network controls).
  • OP emphasizes that the main driver was EU data sovereignty and client distrust of US hyperscalers (CLOUD Act, Schrems II/III, Safe Harbor uncertainty); cost savings made the move easier to justify.
  • AWS’s planned “European Sovereign Cloud” is widely viewed as insufficient for true political/legal independence, though it may tick checkbox‑compliance for some.

Hetzner/OVH Reliability and Risk Mitigation

  • Concerns raised: “dirty” IP reputation, Sybil/spam abuse, sudden account terminations or takedowns at Hetzner, long OVH outages, slow/bankers‑hours support.
  • OP mitigates via: Cloudflare fronting all public traffic (IP allowlisting + ufw), multi‑cloud design (Hetzner + OVH), encrypted multi‑provider backups, and tested DB failover to a hot standby in another provider.
  • Some view these providers as fine for cost‑sensitive workloads but risky as a single point of failure; recommendation is at least cross‑provider backups or active replication.

Architecture and Tooling Choices

  • Stack: Ubuntu VMs, Spring Boot apps, Postgres + streaming replica on another cloud, Redis, Prometheus + Alertmanager, Grafana Agent, Loki, rsyslog/auditd, ufw, chrony, Cloudflare WAF/LB, Certbot for TLS, all codified via Terraform + Ansible.
  • ISO constraints drive separation of concerns: separate monitoring/logging servers, non‑public SSH, no root login, controlled sudo, strict firewalling, encrypted backups, and explicit upgrade/rollback procedures.
  • DB upgrades and DR rely on “replace-with-new-node” patterns and failover promotion rather than managed RDS.

Kubernetes, Monitoring, and Logging Debates

  • OP deliberately avoided Kubernetes on bare metal; previous EKS experience was described as overly complex for “two apps + DB + Redis,” with EBS/AZ quirks and autoscaling issues.
  • Some commenters report acceptable experiences with modern EKS (managed node groups, better addons), but still acknowledge its complexity and YAGNI risk for small stacks.
  • Loki is noted as memory‑hungry; mitigations include careful indexing and query limits. Alternatives mentioned: VictoriaMetrics, Quickwit+Vector.
  • AWS CloudWatch is broadly criticized as slow, expensive, and clunky compared with Prometheus/Grafana/Loki; even simple features like “live tail” cost extra, which some see as misaligned with smaller‑scale needs.