2025-09-24

How AWS S3 serves 1 petabyte per second on top of slow HDDs

Additional resources & corrections on the article

Multiple commenters point to the official “Building and operating a pretty big storage system called S3” post and a recent re:Invent talk as deeper, more authoritative sources.
A technical reader notes the article’s HDD seek-time figures (e.g., “8ms full seek”) are wrong by a large margin; modern high-capacity HDDs have ~20–25ms full-platter seek.
Another highlights that average seek isn’t simply half the full-platter distance, and that ZCAV and head acceleration complicate simple 1/2 or 1/3 models.

Open‑source and homelab analogues

People ask whether any S3-compatible, HDD-optimized open-source systems approximate S3’s performance.
Experiences reported with:
- Ceph+RadosGW (HDD for data, SSD for indexes/metadata; works well but EC tuning is complex, CephFS often underwhelming).
- GlusterFS (functional at scale but considered dated and not recommended for new deployments).
- SeaweedFS (now with RDMA and EC), Apache Ozone (100+ PB HDD clusters, SSD metadata), SwiftStack.
- Garage (simple S3-compatible store; uses duplication only, no erasure coding by design).
For single big servers (e.g., 80 HDDs + a few NVMe), advice is: use ZFS (often with SSDs for metadata/special devices) and accept that most distributed object systems are designed for multi-node scale, not single-node performance.

How S3 is architected (from ex‑employees)

Core “hot path” (GET/PUT/LIST) is synchronous web services, largely Java-based; historically a small number of main services, now hundreds of micro/mid-sized services overall.
Typical GET flow: front-end HTTP → index service (key → internal ID) → storage service (fetch data). Key prefix hashing is used to avoid hotspots.
Internal RPC historically used a custom protocol (STUMPY); later replaced by another custom, more stream-oriented protocol.
Lifecycle transitions (e.g., Standard → Glacier) involve many backend microservices and large batch jobs over trillions of objects; this creates visible daily load “humps” on internal metrics.

HDD vs SSD and Glacier internals

Consensus: main S3 storage is still mostly HDD, with SSDs for indexes/metadata and possibly caches. The new “Express One Zone” is presumed SSD-backed, though AWS is not explicit.
Glacier’s physical backing (tape vs HDD vs other) remains unclear. Comments include insider-style claims (initially S3-based, later tape for some tiers) and a lot of explicit speculation; no definitive public confirmation.

Parallelism & erasure coding details

Many summarize the scaling story as “parallelism”: shard objects across many disks and AZs, then read in parallel.
Commenters stress the non-trivial part is managing disk latency: random sharding and erasure coding allow reconstructing data from any k of n fragments, so reads can avoid slow-seek shards and still succeed quickly.
There is debate over the exact S3 coding scheme. The article’s “5:9” example is criticized as unrealistic for cost and availability; commenters note that S3 likely uses multiple, more efficient (k,n) schemes, though concrete parameters are not disclosed.
Discussion explores how changing k/n trades off storage overhead (~physical/logical bytes), throughput from parallel reads, and availability under AZ failures and independent disk failures.

Ceph & EC tuning subtleties

A Ceph discussion dives into:
- How RGW stripes S3 objects into RADOS objects (default 4 MB), and how EC then subdivides these; naive configs can create HDD-unfriendly small writes unless stripe size is retuned.
- CRUSH-based placement, balancing, and the danger that a single “fullest disk” can cap usable cluster capacity.
- Disagreement on practical safe utilization: some admins are comfortable at ~80–85% raw usage on large, well-balanced clusters; others report operational pain above ~70% on smaller or heterogeneous clusters.

Pricing, economics, and performance classes

Several note that while HDD $/TB has fallen, S3 list prices have been flat for ~8+ years. Some argue competition is weak; others point out that inflation alone implies an effective price drop.
Commenters emphasize that S3’s unit economics are dominated not just by storage but by per-request charges and IOPS/GB trade-offs. AWS can “waste” disk capacity (underfill drives) to deliver high IOPS/GB where customers pay enough in request fees.

Scale, capacity, and “biggest storage on earth”

Using “tens of millions of HDDs” as a back-of-envelope input, commenters infer S3 holds on the order of hundreds of exabytes, likely among the world’s largest single storage systems.
Others speculate about very large government data centers as possible competitors, but also note that public numbers there are highly speculative.

Related topics