Zero-Downtime Kubernetes Deployments on AWS with EKS
Kubernetes adoption, cost, and on‑prem vs cloud
- Several comments argue many companies use Kubernetes without real need, adding complexity and blocking later moves to bare metal/on‑prem.
- Others say Kubernetes on bare metal (e.g. k3s, Talos) works very well and is far cheaper than cloud instances for the same resources.
- One pragmatic argument for Kubernetes: Helm charts and Kubernetes have become the de‑facto standard for on‑prem packaging, so deviating often incurs extra vendor effort and cost.
Cloud‑specific Kubernetes and migration pain
- Managed cloud K8s (EKS, GKE, etc.) often relies on provider‑specific storage classes, load balancers, and certificate management.
- When moving on‑prem, teams must re‑implement or replace these integrations (e.g., object‑storage‑backed PVCs, Longhorn layout), which is non‑trivial.
Stateful services and databases
- Some advocate running Postgres/Redis/RabbitMQ inside the cluster for tighter control and GitOps-only management.
- Others prefer cloud‑managed services (RDS, ElastiCache, SQS) to avoid complex operators and fragile PV/PVC setups; PV resizing and HA are recurring pain points.
EKS/ALB zero‑downtime specifics
- The article’s main issue: AWS Load Balancer Controller plus ALB has a lag between marking pods unready and actually stopping traffic, leading to 502s on rollout.
- Pod Readiness Gates help but do not fully close the timing gap; the controller must also talk to AWS APIs, adding delay.
- Some question whether the article’s approach is necessary, but others report similar errors and paranoia about ELB logs.
Graceful shutdown and “lame duck” patterns
- Common pattern: preStop hook triggers “lame duck” mode (fail readiness, keep serving existing requests, close keep‑alive connections), then terminate.
- Several describe using a dedicated signal (e.g., SIGUSR1) before SIGTERM so the app can stop being “ready” without being killed.
- Newer Kubernetes supports
lifecycle.preStop.sleep.seconds, removing the need for an explicitsleepbinary. - There is concern that frameworks and typical SIGTERM handlers rarely get this right by default.
Complexity, design warts, and workarounds
- Critics call it a “travesty” that the state‑of‑the‑art orchestrator needs explicit sleeps to avoid dropped traffic, contrasting with older two‑phase deployment systems.
- Others respond that orchestration and LB integration are hard and that Kubernetes gives strong health‑check and observability primitives, though with sharp edges.
- Annotation‑driven configuration (e.g., for AWS integrations) is widely disliked as “magic” and hard to validate.
- SubPath/ConfigMap behaviour and config‑driven restarts are cited as surprising, requiring hash annotations or tools like Kustomize.
Alternatives and higher‑level platforms
- AWS ECS and Google Cloud Run are praised as “it just works” solutions for many stateless workloads, with far less operational overhead than Kubernetes.
- Some who moved from K8s to ECS are happy; others miss Kubernetes’ richer UX and ecosystem.
- Ingress‑Nginx or service meshes (Istio) can handle graceful draining better than ALB alone, at the cost of more components.
- Tools like Argo Rollouts, Porter, and custom CRDs are used to hide some of the underlying complexity.
GitOps vs IaC and AWS integration
- One camp claims an all‑in‑cluster GitOps setup (Flux/Kustomize) reduces “IaC overhead”; others counter that GitOps is IaC, and you still need Terraform/CDK for clusters and external services.
- AWS Controllers for Kubernetes (ACK) and pod identities are mentioned as ways to manage AWS resources (e.g., S3) via Kubernetes CRDs instead of Terraform.
Open issues
- Handling very long‑lived client connections during rollout is raised as an unsolved problem; it’s unclear from the discussion whether existing “lame duck” strategies scale to connections lasting minutes or hours.