Making EC2 boot time faster

Why EC2 boot is slow (EBS, snapshots, hydration)

  • Main bottleneck: creating and “hydrating” an EBS root volume from a snapshot stored in S3.
  • First access to blocks is lazy and slow; later boots on the same volume are much faster.
  • EBS doesn’t support true read‑only or copy‑on‑write roots, nor booting directly from local NVMe, so each instance pays a full-volume cost.
  • Some report strong bi‑modal boot times (tens vs ~80s) with unclear but likely capacity / scheduling causes.

Workarounds and optimizations on AWS

  • Strategy in the article: boot once, stop the instance, then later start it from the already‑hydrated EBS volume.
  • Warm Pools exist but are reported as too slow to react (60s+ to notice scale-up) and have feature limits.
  • Minimizing image size and using sparse EBS volumes can reduce hydration time.
  • Fast Snapshot Restore and EBS “fast restore”/hydration with fio help but are expensive and still slower than desired.
  • Some note Amazon Linux 2023 boots faster than Ubuntu.

Alternative architectures: S3, immutable roots, ephemeral storage

  • Several propose tiny boot AMIs that fetch a squashfs/root image from S3 and run from RAM or overlayfs, using local NVMe for scratch.
  • s3fs is called out as unreliable in production; rclone mount suggested instead.
  • Others argue this essentially re‑implements EBS/AMIs, while proponents say what’s missing is cheap read‑only shared roots with CoW overlays.

CI, GitHub Actions, and latency sources

  • Even with fast EC2 boot, GitHub’s own runner startup and job assignment can add ~8–10s, partly negating gains.
  • Some vendors use scale, warm pools, or alternative stacks (Firecracker, unikernels) to get sub‑second to tens‑of‑ms cold starts for CI‑like workloads.

Autoscaling and prediction window

  • Shorter boot times shrink the prediction horizon and make autoscaling more accurate and cost‑efficient.
  • Debate on typical timescales: many workloads can live with 30–60s; others (spiky CI, interactive environments) need much faster reaction.

Cloud vs bare metal / other providers

  • Some argue a single powerful bare‑metal box (or smaller clouds like Hetzner) can give much faster, simpler builds.
  • Others stress that large clouds optimize for multi‑tenant isolation and networked storage, trading off raw latency for flexibility and durability.