I/O is no longer the bottleneck? (2022)
Thread context & meta
- This post is a rebuttal to an earlier “I/O is no longer the bottleneck” article with the same title; some confusion in the thread about the two posts.
- The author has a part 2 and a later addendum; commenters generally find the tricks and analysis there interesting.
What is actually the bottleneck?
- Several comments stress that bottlenecks are workload‑dependent: CPU, memory bandwidth, cache, disk, network, locks, or downstream services can all dominate.
- A recurring theme: the “memory wall” – CPUs have grown faster than memory, so many workloads are limited by memory bandwidth or latency rather than compute.
- One commenter reframes the issue as latency vs throughput: serial, small, scattered operations (DB queries, tiny disk reads, unbatched RPCs) cause idle waiting even when raw bandwidth is high.
Per‑core memory bandwidth debate
- One claim: typical x86 cores top out around ~6 GB/s memcpy, Apple M‑series around ~20 GB/s; this is used to argue parsers can’t exceed those per‑core limits.
- Multiple others strongly dispute these numbers, providing microbenchmark data showing 9–35 GB/s per x86 core and up to ~100+ GB/s on recent Apple chips (with non‑temporal/vectorized copies and “warm” memory).
- Discussion of architectural limits: finite numbers of outstanding cacheline fills (LFB/MSHR entries), DRAM vs SRAM characteristics, motherboard wiring limits, and how in‑package memory (Apple, some Ryzen variants) raises effective bandwidth.
- Some note you often need the iGPU or multiple cores to actually saturate memory channels.
SSD/NVMe and I/O characteristics
- Modern NVMe sequential reads (10–14 GB/s) can exceed what a single core can process, but:
- Peak numbers are short bursts; sustained real‑world throughput is lower, especially with random access.
- DMA allows SSDs to move data without consuming CPU cycles, shifting the bottleneck back to memory and higher‑level processing.
- Debate over whether multiple cores are needed to saturate SSDs; consensus is that IOPS patterns (many small writes vs larger ones) matter more than raw bandwidth.
Serialization, zero‑copy formats, and parsing
- One line of argument: formats like JSON/Protobuf require full parsing before accessing fields, so they’re constrained by per‑core scan bandwidth.
- Zero‑copy, indexed formats can “skip” large parts of messages (only touching needed cachelines), effectively delivering higher useful throughput per core.
- A new format (Lite³) is discussed:
- Schemaless, fully indexed, allows in‑place mutation; trades message size for flexibility.
- Some see schemaless as great ergonomically; others argue that in practice you always have a schema and want size benefits from encoding it.
- Questions around fragmentation, vacuuming, and how variable‑length fields are updated in place.
- Comparisons and references to Cap’n Proto, Flatbuffers, rkyv, and another schema‑based format (STEF).
- Skeptics note that “outperforming” parsers is easier when data is effectively pre‑parsed and memory‑mapped; analogy debates (tanker truck vs beer cans) explore what counts as “real” work.
Future architectures & unified memory/I/O
- Speculation that CXL/PCIe and AI‑driven investment may push architectures toward a mesh of CPUs, RAM, storage, GPUs in one unified virtual address space.
- Others point out that today’s systems already map devices into a common address space via PCIe and mmap, but practical concerns (filesystems, sharing between processes) keep higher‑level abstractions in place.
- Some wishlist ideas: mmap with malloc‑like ergonomics, trivial “make this buffer persistent” APIs, SSD treated more like extended RAM.
fsync, persistence, and mmap
- Even with fast NVMe, fsync remains slow and important for true durability, especially for databases.
- One view: many applications could relax durability (rely on backups) and benefit from mmap‑style persistence, where process crashes don’t lose data.
- Questions about fsync semantics (waiting on all operations vs relevant ranges) and whether NVMe controllers sometimes lie about flush completion.
Real‑world performance anecdotes
- An OLAP database optimizer reports memory bandwidth, not disk, as the bottleneck under high concurrency.
- Others note that in cloud VMs/containers, storage I/O is still a very real bottleneck; managed/cloud setups often deliver far less performance per dollar than local hardware.
Software latency and bloat vs hardware gains
- Several commenters observe that despite enormous hardware advances, many everyday apps (messengers, Windows UI, etc.) feel slower.
- Proposed causes:
- Blocking I/O on the GUI thread and insufficient attention to latency.
- Software bloat (heavy frameworks/electron apps) keeping CPUs busy doing non‑essential work.
- Historical examples (e.g., C64 word processors) show carefully staged UI work for responsiveness; people lament that such disciplined engineering is rarer now.
General takeaway
- No single universal bottleneck: modern systems are a balance of CPU, memory hierarchy, storage, and concurrency.
- The consensus advice: measure actual workloads (profiles, traces), identify what saturates first, change one thing, and measure again rather than relying on slogans like “I/O is no longer the bottleneck.”