io_uring is faster than mmap

Why io_uring Outperformed mmap in the Article’s Setup

  • io_uring test used O_DIRECT and a kernel worker pool: multiple threads issue I/O and fill a small set of user buffers; the main thread just scans them.
  • mmap test was effectively single-threaded and page-fault driven: every 4K page not yet mapped triggers a fault and VM bookkeeping, even when data is already in the page cache.
  • Commenters argue the big win is reduced page-fault/VM overhead and parallelism, not “disk being faster than RAM.”
  • Several note that a fairer comparison would use multi-threaded mmap with prefetch, which can get close to io_uring performance, especially when data is cached.

Tuning mmap: Huge Pages, Prefetching, and Threads

  • 4K pages over tens–hundreds of GB create huge page tables and TLB pressure; this can dominate CPU time.
  • Some report large speedups on huge files with MAP_HUGETLB/MAP_HUGE_1GB; others note filesystem and alignment constraints and mixed results.
  • MAP_POPULATE was tested: it improved inner-loop bandwidth but increased total run time (populate cost dominated).
  • Suggestions: MADV_SEQUENTIAL, background prefetch threads that touch each page, multi-threaded mmap access; an experiment with 6 prefetching threads reached ~5.8 GB/s, similar to io_uring on two drives but still below a pure in-RAM copy.

PCIe vs Memory and DDIO

  • Debate over “PCIe bandwidth is higher than memory bandwidth”:
    • On some server parts, total PCIe bandwidth (all lanes) can rival or exceed DRAM channel bandwidth; on desktops it’s often similar or lower.
    • Everyone agrees DRAM and caches still have lower latency and higher per-channel bandwidth; disk traffic ultimately ends up in RAM.
  • Intel DDIO and similar features can DMA directly into L3 cache, briefly bypassing DRAM; this is mentioned as a theoretical path where device→CPU data can look “faster than memory,” but not exercised in the article.

Methodology, Title, and Alternatives

  • Multiple commenters call the original “Memory is slow, Disk is fast” framing clickbait, preferring titles like “io_uring is faster than mmap (in this setup).”
  • Criticism centers on comparing an optimized async pipeline to a naive, single-threaded mmap loop without readahead hints.
  • Others find the exploration valuable and see mmap as a convenience API, not a high-performance primitive.
  • Suggested further work: SPDK comparisons, AVX512 + manual prefetch, NUMA-aware allocation, splice/vmsplice, and better visualizations (log-scaled, normalized plots).

io_uring Security & Deployment

  • io_uring is viewed as powerful but complex and still evolving; some platforms disable it over LSM/SELinux integration and attack-surface concerns.
  • Guidance in the thread: fine for non-hostile workloads with regular kernel updates; use stronger isolation for untrusted code.