2024-05-23

Making EC2 boot time faster

Why EC2 boot is slow (EBS, snapshots, hydration)

Main bottleneck: creating and “hydrating” an EBS root volume from a snapshot stored in S3.
First access to blocks is lazy and slow; later boots on the same volume are much faster.
EBS doesn’t support true read‑only or copy‑on‑write roots, nor booting directly from local NVMe, so each instance pays a full-volume cost.
Some report strong bi‑modal boot times (tens vs ~80s) with unclear but likely capacity / scheduling causes.

Workarounds and optimizations on AWS

Strategy in the article: boot once, stop the instance, then later start it from the already‑hydrated EBS volume.
Warm Pools exist but are reported as too slow to react (60s+ to notice scale-up) and have feature limits.
Minimizing image size and using sparse EBS volumes can reduce hydration time.
Fast Snapshot Restore and EBS “fast restore”/hydration with fio help but are expensive and still slower than desired.
Some note Amazon Linux 2023 boots faster than Ubuntu.

Alternative architectures: S3, immutable roots, ephemeral storage

Several propose tiny boot AMIs that fetch a squashfs/root image from S3 and run from RAM or overlayfs, using local NVMe for scratch.
s3fs is called out as unreliable in production; rclone mount suggested instead.
Others argue this essentially re‑implements EBS/AMIs, while proponents say what’s missing is cheap read‑only shared roots with CoW overlays.

CI, GitHub Actions, and latency sources

Even with fast EC2 boot, GitHub’s own runner startup and job assignment can add ~8–10s, partly negating gains.
Some vendors use scale, warm pools, or alternative stacks (Firecracker, unikernels) to get sub‑second to tens‑of‑ms cold starts for CI‑like workloads.

Autoscaling and prediction window

Shorter boot times shrink the prediction horizon and make autoscaling more accurate and cost‑efficient.
Debate on typical timescales: many workloads can live with 30–60s; others (spiky CI, interactive environments) need much faster reaction.

Cloud vs bare metal / other providers

Some argue a single powerful bare‑metal box (or smaller clouds like Hetzner) can give much faster, simpler builds.
Others stress that large clouds optimize for multi‑tenant isolation and networked storage, trading off raw latency for flexibility and durability.

Related topics