2026-01-06

I/O is no longer the bottleneck? (2022)

Thread context & meta

This post is a rebuttal to an earlier “I/O is no longer the bottleneck” article with the same title; some confusion in the thread about the two posts.
The author has a part 2 and a later addendum; commenters generally find the tricks and analysis there interesting.

What is actually the bottleneck?

Several comments stress that bottlenecks are workload‑dependent: CPU, memory bandwidth, cache, disk, network, locks, or downstream services can all dominate.
A recurring theme: the “memory wall” – CPUs have grown faster than memory, so many workloads are limited by memory bandwidth or latency rather than compute.
One commenter reframes the issue as latency vs throughput: serial, small, scattered operations (DB queries, tiny disk reads, unbatched RPCs) cause idle waiting even when raw bandwidth is high.

Per‑core memory bandwidth debate

One claim: typical x86 cores top out around ~6 GB/s memcpy, Apple M‑series around ~20 GB/s; this is used to argue parsers can’t exceed those per‑core limits.
Multiple others strongly dispute these numbers, providing microbenchmark data showing 9–35 GB/s per x86 core and up to ~100+ GB/s on recent Apple chips (with non‑temporal/vectorized copies and “warm” memory).
Discussion of architectural limits: finite numbers of outstanding cacheline fills (LFB/MSHR entries), DRAM vs SRAM characteristics, motherboard wiring limits, and how in‑package memory (Apple, some Ryzen variants) raises effective bandwidth.
Some note you often need the iGPU or multiple cores to actually saturate memory channels.

SSD/NVMe and I/O characteristics

Modern NVMe sequential reads (10–14 GB/s) can exceed what a single core can process, but:
- Peak numbers are short bursts; sustained real‑world throughput is lower, especially with random access.
- DMA allows SSDs to move data without consuming CPU cycles, shifting the bottleneck back to memory and higher‑level processing.
Debate over whether multiple cores are needed to saturate SSDs; consensus is that IOPS patterns (many small writes vs larger ones) matter more than raw bandwidth.

Serialization, zero‑copy formats, and parsing

One line of argument: formats like JSON/Protobuf require full parsing before accessing fields, so they’re constrained by per‑core scan bandwidth.
Zero‑copy, indexed formats can “skip” large parts of messages (only touching needed cachelines), effectively delivering higher useful throughput per core.
A new format (Lite³) is discussed:
- Schemaless, fully indexed, allows in‑place mutation; trades message size for flexibility.
- Some see schemaless as great ergonomically; others argue that in practice you always have a schema and want size benefits from encoding it.
- Questions around fragmentation, vacuuming, and how variable‑length fields are updated in place.
- Comparisons and references to Cap’n Proto, Flatbuffers, rkyv, and another schema‑based format (STEF).
Skeptics note that “outperforming” parsers is easier when data is effectively pre‑parsed and memory‑mapped; analogy debates (tanker truck vs beer cans) explore what counts as “real” work.

Future architectures & unified memory/I/O

Speculation that CXL/PCIe and AI‑driven investment may push architectures toward a mesh of CPUs, RAM, storage, GPUs in one unified virtual address space.
Others point out that today’s systems already map devices into a common address space via PCIe and mmap, but practical concerns (filesystems, sharing between processes) keep higher‑level abstractions in place.
Some wishlist ideas: mmap with malloc‑like ergonomics, trivial “make this buffer persistent” APIs, SSD treated more like extended RAM.

fsync, persistence, and mmap

Even with fast NVMe, fsync remains slow and important for true durability, especially for databases.
One view: many applications could relax durability (rely on backups) and benefit from mmap‑style persistence, where process crashes don’t lose data.
Questions about fsync semantics (waiting on all operations vs relevant ranges) and whether NVMe controllers sometimes lie about flush completion.

Real‑world performance anecdotes

An OLAP database optimizer reports memory bandwidth, not disk, as the bottleneck under high concurrency.
Others note that in cloud VMs/containers, storage I/O is still a very real bottleneck; managed/cloud setups often deliver far less performance per dollar than local hardware.

Software latency and bloat vs hardware gains

Several commenters observe that despite enormous hardware advances, many everyday apps (messengers, Windows UI, etc.) feel slower.
Proposed causes:
- Blocking I/O on the GUI thread and insufficient attention to latency.
- Software bloat (heavy frameworks/electron apps) keeping CPUs busy doing non‑essential work.
Historical examples (e.g., C64 word processors) show carefully staged UI work for responsiveness; people lament that such disciplined engineering is rarer now.

General takeaway

No single universal bottleneck: modern systems are a balance of CPU, memory hierarchy, storage, and concurrency.
The consensus advice: measure actual workloads (profiles, traces), identify what saturates first, change one thing, and measure again rather than relying on slogans like “I/O is no longer the bottleneck.”

Related topics