2025-03-10

Zero-Downtime Kubernetes Deployments on AWS with EKS

Kubernetes adoption, cost, and on‑prem vs cloud

Several comments argue many companies use Kubernetes without real need, adding complexity and blocking later moves to bare metal/on‑prem.
Others say Kubernetes on bare metal (e.g. k3s, Talos) works very well and is far cheaper than cloud instances for the same resources.
One pragmatic argument for Kubernetes: Helm charts and Kubernetes have become the de‑facto standard for on‑prem packaging, so deviating often incurs extra vendor effort and cost.

Cloud‑specific Kubernetes and migration pain

Managed cloud K8s (EKS, GKE, etc.) often relies on provider‑specific storage classes, load balancers, and certificate management.
When moving on‑prem, teams must re‑implement or replace these integrations (e.g., object‑storage‑backed PVCs, Longhorn layout), which is non‑trivial.

Stateful services and databases

Some advocate running Postgres/Redis/RabbitMQ inside the cluster for tighter control and GitOps-only management.
Others prefer cloud‑managed services (RDS, ElastiCache, SQS) to avoid complex operators and fragile PV/PVC setups; PV resizing and HA are recurring pain points.

EKS/ALB zero‑downtime specifics

The article’s main issue: AWS Load Balancer Controller plus ALB has a lag between marking pods unready and actually stopping traffic, leading to 502s on rollout.
Pod Readiness Gates help but do not fully close the timing gap; the controller must also talk to AWS APIs, adding delay.
Some question whether the article’s approach is necessary, but others report similar errors and paranoia about ELB logs.

Graceful shutdown and “lame duck” patterns

Common pattern: preStop hook triggers “lame duck” mode (fail readiness, keep serving existing requests, close keep‑alive connections), then terminate.
Several describe using a dedicated signal (e.g., SIGUSR1) before SIGTERM so the app can stop being “ready” without being killed.
Newer Kubernetes supports lifecycle.preStop.sleep.seconds, removing the need for an explicit sleep binary.
There is concern that frameworks and typical SIGTERM handlers rarely get this right by default.

Complexity, design warts, and workarounds

Critics call it a “travesty” that the state‑of‑the‑art orchestrator needs explicit sleeps to avoid dropped traffic, contrasting with older two‑phase deployment systems.
Others respond that orchestration and LB integration are hard and that Kubernetes gives strong health‑check and observability primitives, though with sharp edges.
Annotation‑driven configuration (e.g., for AWS integrations) is widely disliked as “magic” and hard to validate.
SubPath/ConfigMap behaviour and config‑driven restarts are cited as surprising, requiring hash annotations or tools like Kustomize.

Alternatives and higher‑level platforms

AWS ECS and Google Cloud Run are praised as “it just works” solutions for many stateless workloads, with far less operational overhead than Kubernetes.
Some who moved from K8s to ECS are happy; others miss Kubernetes’ richer UX and ecosystem.
Ingress‑Nginx or service meshes (Istio) can handle graceful draining better than ALB alone, at the cost of more components.
Tools like Argo Rollouts, Porter, and custom CRDs are used to hide some of the underlying complexity.

GitOps vs IaC and AWS integration

One camp claims an all‑in‑cluster GitOps setup (Flux/Kustomize) reduces “IaC overhead”; others counter that GitOps is IaC, and you still need Terraform/CDK for clusters and external services.
AWS Controllers for Kubernetes (ACK) and pod identities are mentioned as ways to manage AWS resources (e.g., S3) via Kubernetes CRDs instead of Terraform.

Open issues

Handling very long‑lived client connections during rollout is raised as an unsolved problem; it’s unclear from the discussion whether existing “lame duck” strategies scale to connections lasting minutes or hours.

Related topics