How AWS S3 serves 1 petabyte per second on top of slow HDDs
Additional resources & corrections on the article
- Multiple commenters point to the official “Building and operating a pretty big storage system called S3” post and a recent re:Invent talk as deeper, more authoritative sources.
- A technical reader notes the article’s HDD seek-time figures (e.g., “8ms full seek”) are wrong by a large margin; modern high-capacity HDDs have ~20–25ms full-platter seek.
- Another highlights that average seek isn’t simply half the full-platter distance, and that ZCAV and head acceleration complicate simple 1/2 or 1/3 models.
Open‑source and homelab analogues
- People ask whether any S3-compatible, HDD-optimized open-source systems approximate S3’s performance.
- Experiences reported with:
- Ceph+RadosGW (HDD for data, SSD for indexes/metadata; works well but EC tuning is complex, CephFS often underwhelming).
- GlusterFS (functional at scale but considered dated and not recommended for new deployments).
- SeaweedFS (now with RDMA and EC), Apache Ozone (100+ PB HDD clusters, SSD metadata), SwiftStack.
- Garage (simple S3-compatible store; uses duplication only, no erasure coding by design).
- For single big servers (e.g., 80 HDDs + a few NVMe), advice is: use ZFS (often with SSDs for metadata/special devices) and accept that most distributed object systems are designed for multi-node scale, not single-node performance.
How S3 is architected (from ex‑employees)
- Core “hot path” (GET/PUT/LIST) is synchronous web services, largely Java-based; historically a small number of main services, now hundreds of micro/mid-sized services overall.
- Typical GET flow: front-end HTTP → index service (key → internal ID) → storage service (fetch data). Key prefix hashing is used to avoid hotspots.
- Internal RPC historically used a custom protocol (STUMPY); later replaced by another custom, more stream-oriented protocol.
- Lifecycle transitions (e.g., Standard → Glacier) involve many backend microservices and large batch jobs over trillions of objects; this creates visible daily load “humps” on internal metrics.
HDD vs SSD and Glacier internals
- Consensus: main S3 storage is still mostly HDD, with SSDs for indexes/metadata and possibly caches. The new “Express One Zone” is presumed SSD-backed, though AWS is not explicit.
- Glacier’s physical backing (tape vs HDD vs other) remains unclear. Comments include insider-style claims (initially S3-based, later tape for some tiers) and a lot of explicit speculation; no definitive public confirmation.
Parallelism & erasure coding details
- Many summarize the scaling story as “parallelism”: shard objects across many disks and AZs, then read in parallel.
- Commenters stress the non-trivial part is managing disk latency: random sharding and erasure coding allow reconstructing data from any k of n fragments, so reads can avoid slow-seek shards and still succeed quickly.
- There is debate over the exact S3 coding scheme. The article’s “5:9” example is criticized as unrealistic for cost and availability; commenters note that S3 likely uses multiple, more efficient (k,n) schemes, though concrete parameters are not disclosed.
- Discussion explores how changing k/n trades off storage overhead (~physical/logical bytes), throughput from parallel reads, and availability under AZ failures and independent disk failures.
Ceph & EC tuning subtleties
- A Ceph discussion dives into:
- How RGW stripes S3 objects into RADOS objects (default 4 MB), and how EC then subdivides these; naive configs can create HDD-unfriendly small writes unless stripe size is retuned.
- CRUSH-based placement, balancing, and the danger that a single “fullest disk” can cap usable cluster capacity.
- Disagreement on practical safe utilization: some admins are comfortable at ~80–85% raw usage on large, well-balanced clusters; others report operational pain above ~70% on smaller or heterogeneous clusters.
Pricing, economics, and performance classes
- Several note that while HDD $/TB has fallen, S3 list prices have been flat for ~8+ years. Some argue competition is weak; others point out that inflation alone implies an effective price drop.
- Commenters emphasize that S3’s unit economics are dominated not just by storage but by per-request charges and IOPS/GB trade-offs. AWS can “waste” disk capacity (underfill drives) to deliver high IOPS/GB where customers pay enough in request fees.
Scale, capacity, and “biggest storage on earth”
- Using “tens of millions of HDDs” as a back-of-envelope input, commenters infer S3 holds on the order of hundreds of exabytes, likely among the world’s largest single storage systems.
- Others speculate about very large government data centers as possible competitors, but also note that public numbers there are highly speculative.