2025-09-04

io_uring is faster than mmap

Why io_uring Outperformed mmap in the Article’s Setup

io_uring test used O_DIRECT and a kernel worker pool: multiple threads issue I/O and fill a small set of user buffers; the main thread just scans them.
mmap test was effectively single-threaded and page-fault driven: every 4K page not yet mapped triggers a fault and VM bookkeeping, even when data is already in the page cache.
Commenters argue the big win is reduced page-fault/VM overhead and parallelism, not “disk being faster than RAM.”
Several note that a fairer comparison would use multi-threaded mmap with prefetch, which can get close to io_uring performance, especially when data is cached.

Tuning mmap: Huge Pages, Prefetching, and Threads

4K pages over tens–hundreds of GB create huge page tables and TLB pressure; this can dominate CPU time.
Some report large speedups on huge files with MAP_HUGETLB/MAP_HUGE_1GB; others note filesystem and alignment constraints and mixed results.
MAP_POPULATE was tested: it improved inner-loop bandwidth but increased total run time (populate cost dominated).
Suggestions: MADV_SEQUENTIAL, background prefetch threads that touch each page, multi-threaded mmap access; an experiment with 6 prefetching threads reached ~5.8 GB/s, similar to io_uring on two drives but still below a pure in-RAM copy.

PCIe vs Memory and DDIO

Debate over “PCIe bandwidth is higher than memory bandwidth”:
- On some server parts, total PCIe bandwidth (all lanes) can rival or exceed DRAM channel bandwidth; on desktops it’s often similar or lower.
- Everyone agrees DRAM and caches still have lower latency and higher per-channel bandwidth; disk traffic ultimately ends up in RAM.
Intel DDIO and similar features can DMA directly into L3 cache, briefly bypassing DRAM; this is mentioned as a theoretical path where device→CPU data can look “faster than memory,” but not exercised in the article.

Methodology, Title, and Alternatives

Multiple commenters call the original “Memory is slow, Disk is fast” framing clickbait, preferring titles like “io_uring is faster than mmap (in this setup).”
Criticism centers on comparing an optimized async pipeline to a naive, single-threaded mmap loop without readahead hints.
Others find the exploration valuable and see mmap as a convenience API, not a high-performance primitive.
Suggested further work: SPDK comparisons, AVX512 + manual prefetch, NUMA-aware allocation, splice/vmsplice, and better visualizations (log-scaled, normalized plots).

io_uring Security & Deployment

io_uring is viewed as powerful but complex and still evolving; some platforms disable it over LSM/SELinux integration and attack-surface concerns.
Guidance in the thread: fine for non-hostile workloads with regular kernel updates; use stronger isolation for untrusted code.

Related topics