I/O is no longer the bottleneck? (2022)

Thread context & meta

  • This post is a rebuttal to an earlier “I/O is no longer the bottleneck” article with the same title; some confusion in the thread about the two posts.
  • The author has a part 2 and a later addendum; commenters generally find the tricks and analysis there interesting.

What is actually the bottleneck?

  • Several comments stress that bottlenecks are workload‑dependent: CPU, memory bandwidth, cache, disk, network, locks, or downstream services can all dominate.
  • A recurring theme: the “memory wall” – CPUs have grown faster than memory, so many workloads are limited by memory bandwidth or latency rather than compute.
  • One commenter reframes the issue as latency vs throughput: serial, small, scattered operations (DB queries, tiny disk reads, unbatched RPCs) cause idle waiting even when raw bandwidth is high.

Per‑core memory bandwidth debate

  • One claim: typical x86 cores top out around ~6 GB/s memcpy, Apple M‑series around ~20 GB/s; this is used to argue parsers can’t exceed those per‑core limits.
  • Multiple others strongly dispute these numbers, providing microbenchmark data showing 9–35 GB/s per x86 core and up to ~100+ GB/s on recent Apple chips (with non‑temporal/vectorized copies and “warm” memory).
  • Discussion of architectural limits: finite numbers of outstanding cacheline fills (LFB/MSHR entries), DRAM vs SRAM characteristics, motherboard wiring limits, and how in‑package memory (Apple, some Ryzen variants) raises effective bandwidth.
  • Some note you often need the iGPU or multiple cores to actually saturate memory channels.

SSD/NVMe and I/O characteristics

  • Modern NVMe sequential reads (10–14 GB/s) can exceed what a single core can process, but:
    • Peak numbers are short bursts; sustained real‑world throughput is lower, especially with random access.
    • DMA allows SSDs to move data without consuming CPU cycles, shifting the bottleneck back to memory and higher‑level processing.
  • Debate over whether multiple cores are needed to saturate SSDs; consensus is that IOPS patterns (many small writes vs larger ones) matter more than raw bandwidth.

Serialization, zero‑copy formats, and parsing

  • One line of argument: formats like JSON/Protobuf require full parsing before accessing fields, so they’re constrained by per‑core scan bandwidth.
  • Zero‑copy, indexed formats can “skip” large parts of messages (only touching needed cachelines), effectively delivering higher useful throughput per core.
  • A new format (Lite³) is discussed:
    • Schemaless, fully indexed, allows in‑place mutation; trades message size for flexibility.
    • Some see schemaless as great ergonomically; others argue that in practice you always have a schema and want size benefits from encoding it.
    • Questions around fragmentation, vacuuming, and how variable‑length fields are updated in place.
    • Comparisons and references to Cap’n Proto, Flatbuffers, rkyv, and another schema‑based format (STEF).
  • Skeptics note that “outperforming” parsers is easier when data is effectively pre‑parsed and memory‑mapped; analogy debates (tanker truck vs beer cans) explore what counts as “real” work.

Future architectures & unified memory/I/O

  • Speculation that CXL/PCIe and AI‑driven investment may push architectures toward a mesh of CPUs, RAM, storage, GPUs in one unified virtual address space.
  • Others point out that today’s systems already map devices into a common address space via PCIe and mmap, but practical concerns (filesystems, sharing between processes) keep higher‑level abstractions in place.
  • Some wishlist ideas: mmap with malloc‑like ergonomics, trivial “make this buffer persistent” APIs, SSD treated more like extended RAM.

fsync, persistence, and mmap

  • Even with fast NVMe, fsync remains slow and important for true durability, especially for databases.
  • One view: many applications could relax durability (rely on backups) and benefit from mmap‑style persistence, where process crashes don’t lose data.
  • Questions about fsync semantics (waiting on all operations vs relevant ranges) and whether NVMe controllers sometimes lie about flush completion.

Real‑world performance anecdotes

  • An OLAP database optimizer reports memory bandwidth, not disk, as the bottleneck under high concurrency.
  • Others note that in cloud VMs/containers, storage I/O is still a very real bottleneck; managed/cloud setups often deliver far less performance per dollar than local hardware.

Software latency and bloat vs hardware gains

  • Several commenters observe that despite enormous hardware advances, many everyday apps (messengers, Windows UI, etc.) feel slower.
  • Proposed causes:
    • Blocking I/O on the GUI thread and insufficient attention to latency.
    • Software bloat (heavy frameworks/electron apps) keeping CPUs busy doing non‑essential work.
  • Historical examples (e.g., C64 word processors) show carefully staged UI work for responsiveness; people lament that such disciplined engineering is rarer now.

General takeaway

  • No single universal bottleneck: modern systems are a balance of CPU, memory hierarchy, storage, and concurrency.
  • The consensus advice: measure actual workloads (profiles, traces), identify what saturates first, change one thing, and measure again rather than relying on slogans like “I/O is no longer the bottleneck.”