Zero-Downtime Kubernetes Deployments on AWS with EKS

Kubernetes adoption, cost, and on‑prem vs cloud

  • Several comments argue many companies use Kubernetes without real need, adding complexity and blocking later moves to bare metal/on‑prem.
  • Others say Kubernetes on bare metal (e.g. k3s, Talos) works very well and is far cheaper than cloud instances for the same resources.
  • One pragmatic argument for Kubernetes: Helm charts and Kubernetes have become the de‑facto standard for on‑prem packaging, so deviating often incurs extra vendor effort and cost.

Cloud‑specific Kubernetes and migration pain

  • Managed cloud K8s (EKS, GKE, etc.) often relies on provider‑specific storage classes, load balancers, and certificate management.
  • When moving on‑prem, teams must re‑implement or replace these integrations (e.g., object‑storage‑backed PVCs, Longhorn layout), which is non‑trivial.

Stateful services and databases

  • Some advocate running Postgres/Redis/RabbitMQ inside the cluster for tighter control and GitOps-only management.
  • Others prefer cloud‑managed services (RDS, ElastiCache, SQS) to avoid complex operators and fragile PV/PVC setups; PV resizing and HA are recurring pain points.

EKS/ALB zero‑downtime specifics

  • The article’s main issue: AWS Load Balancer Controller plus ALB has a lag between marking pods unready and actually stopping traffic, leading to 502s on rollout.
  • Pod Readiness Gates help but do not fully close the timing gap; the controller must also talk to AWS APIs, adding delay.
  • Some question whether the article’s approach is necessary, but others report similar errors and paranoia about ELB logs.

Graceful shutdown and “lame duck” patterns

  • Common pattern: preStop hook triggers “lame duck” mode (fail readiness, keep serving existing requests, close keep‑alive connections), then terminate.
  • Several describe using a dedicated signal (e.g., SIGUSR1) before SIGTERM so the app can stop being “ready” without being killed.
  • Newer Kubernetes supports lifecycle.preStop.sleep.seconds, removing the need for an explicit sleep binary.
  • There is concern that frameworks and typical SIGTERM handlers rarely get this right by default.

Complexity, design warts, and workarounds

  • Critics call it a “travesty” that the state‑of‑the‑art orchestrator needs explicit sleeps to avoid dropped traffic, contrasting with older two‑phase deployment systems.
  • Others respond that orchestration and LB integration are hard and that Kubernetes gives strong health‑check and observability primitives, though with sharp edges.
  • Annotation‑driven configuration (e.g., for AWS integrations) is widely disliked as “magic” and hard to validate.
  • SubPath/ConfigMap behaviour and config‑driven restarts are cited as surprising, requiring hash annotations or tools like Kustomize.

Alternatives and higher‑level platforms

  • AWS ECS and Google Cloud Run are praised as “it just works” solutions for many stateless workloads, with far less operational overhead than Kubernetes.
  • Some who moved from K8s to ECS are happy; others miss Kubernetes’ richer UX and ecosystem.
  • Ingress‑Nginx or service meshes (Istio) can handle graceful draining better than ALB alone, at the cost of more components.
  • Tools like Argo Rollouts, Porter, and custom CRDs are used to hide some of the underlying complexity.

GitOps vs IaC and AWS integration

  • One camp claims an all‑in‑cluster GitOps setup (Flux/Kustomize) reduces “IaC overhead”; others counter that GitOps is IaC, and you still need Terraform/CDK for clusters and external services.
  • AWS Controllers for Kubernetes (ACK) and pod identities are mentioned as ways to manage AWS resources (e.g., S3) via Kubernetes CRDs instead of Terraform.

Open issues

  • Handling very long‑lived client connections during rollout is raised as an unsolved problem; it’s unclear from the discussion whether existing “lame duck” strategies scale to connections lasting minutes or hours.