io_uring is faster than mmap
Why io_uring Outperformed mmap in the Article’s Setup
- io_uring test used O_DIRECT and a kernel worker pool: multiple threads issue I/O and fill a small set of user buffers; the main thread just scans them.
- mmap test was effectively single-threaded and page-fault driven: every 4K page not yet mapped triggers a fault and VM bookkeeping, even when data is already in the page cache.
- Commenters argue the big win is reduced page-fault/VM overhead and parallelism, not “disk being faster than RAM.”
- Several note that a fairer comparison would use multi-threaded mmap with prefetch, which can get close to io_uring performance, especially when data is cached.
Tuning mmap: Huge Pages, Prefetching, and Threads
- 4K pages over tens–hundreds of GB create huge page tables and TLB pressure; this can dominate CPU time.
- Some report large speedups on huge files with
MAP_HUGETLB/MAP_HUGE_1GB; others note filesystem and alignment constraints and mixed results. MAP_POPULATEwas tested: it improved inner-loop bandwidth but increased total run time (populate cost dominated).- Suggestions:
MADV_SEQUENTIAL, background prefetch threads that touch each page, multi-threaded mmap access; an experiment with 6 prefetching threads reached ~5.8 GB/s, similar to io_uring on two drives but still below a pure in-RAM copy.
PCIe vs Memory and DDIO
- Debate over “PCIe bandwidth is higher than memory bandwidth”:
- On some server parts, total PCIe bandwidth (all lanes) can rival or exceed DRAM channel bandwidth; on desktops it’s often similar or lower.
- Everyone agrees DRAM and caches still have lower latency and higher per-channel bandwidth; disk traffic ultimately ends up in RAM.
- Intel DDIO and similar features can DMA directly into L3 cache, briefly bypassing DRAM; this is mentioned as a theoretical path where device→CPU data can look “faster than memory,” but not exercised in the article.
Methodology, Title, and Alternatives
- Multiple commenters call the original “Memory is slow, Disk is fast” framing clickbait, preferring titles like “io_uring is faster than mmap (in this setup).”
- Criticism centers on comparing an optimized async pipeline to a naive, single-threaded mmap loop without readahead hints.
- Others find the exploration valuable and see mmap as a convenience API, not a high-performance primitive.
- Suggested further work: SPDK comparisons, AVX512 + manual prefetch, NUMA-aware allocation, splice/vmsplice, and better visualizations (log-scaled, normalized plots).
io_uring Security & Deployment
- io_uring is viewed as powerful but complex and still evolving; some platforms disable it over LSM/SELinux integration and attack-surface concerns.
- Guidance in the thread: fine for non-hostile workloads with regular kernel updates; use stronger isolation for untrusted code.