Making EC2 boot time faster
Why EC2 boot is slow (EBS, snapshots, hydration)
- Main bottleneck: creating and “hydrating” an EBS root volume from a snapshot stored in S3.
- First access to blocks is lazy and slow; later boots on the same volume are much faster.
- EBS doesn’t support true read‑only or copy‑on‑write roots, nor booting directly from local NVMe, so each instance pays a full-volume cost.
- Some report strong bi‑modal boot times (tens vs ~80s) with unclear but likely capacity / scheduling causes.
Workarounds and optimizations on AWS
- Strategy in the article: boot once, stop the instance, then later start it from the already‑hydrated EBS volume.
- Warm Pools exist but are reported as too slow to react (60s+ to notice scale-up) and have feature limits.
- Minimizing image size and using sparse EBS volumes can reduce hydration time.
- Fast Snapshot Restore and EBS “fast restore”/hydration with fio help but are expensive and still slower than desired.
- Some note Amazon Linux 2023 boots faster than Ubuntu.
Alternative architectures: S3, immutable roots, ephemeral storage
- Several propose tiny boot AMIs that fetch a squashfs/root image from S3 and run from RAM or overlayfs, using local NVMe for scratch.
- s3fs is called out as unreliable in production; rclone mount suggested instead.
- Others argue this essentially re‑implements EBS/AMIs, while proponents say what’s missing is cheap read‑only shared roots with CoW overlays.
CI, GitHub Actions, and latency sources
- Even with fast EC2 boot, GitHub’s own runner startup and job assignment can add ~8–10s, partly negating gains.
- Some vendors use scale, warm pools, or alternative stacks (Firecracker, unikernels) to get sub‑second to tens‑of‑ms cold starts for CI‑like workloads.
Autoscaling and prediction window
- Shorter boot times shrink the prediction horizon and make autoscaling more accurate and cost‑efficient.
- Debate on typical timescales: many workloads can live with 30–60s; others (spiky CI, interactive environments) need much faster reaction.
Cloud vs bare metal / other providers
- Some argue a single powerful bare‑metal box (or smaller clouds like Hetzner) can give much faster, simpler builds.
- Others stress that large clouds optimize for multi‑tenant isolation and networked storage, trading off raw latency for flexibility and durability.