The Fastest Mutexes
Library and implementation comparisons
- The article’s fast mutexes build on the
nsynclibrary; commenters note it’s by a well‑known engineer and compare it to other advanced mutex implementations (e.g., Abseil, SRWLOCK, Rust’s evolving std::Mutex). - Some wonder why certain high-quality mutexes (e.g., Abseil’s) weren’t benchmarked. Others point out that “std::mutex” often just wraps pthreads, so benchmarking pthread is equivalent.
- Rust’s mutex implementation has undergone multiple revisions (2012–2024), focusing on moveability, const construction, and platform-specific optimizations; Linux code is still largely derived from a prior well-regarded implementation.
Fairness vs throughput
- Cosmopolitan’s mutex is explicitly unfair but tries to reduce starvation with a queue and priority scheme; it can’t strictly guarantee no starvation.
- Several comments argue most high-performance locks today are unfair because fairness creates convoying and throughput loss.
- Others insist fairness and starvation properties should be explicit dimensions in evaluations, as unfair locks can severely underutilize cores for certain workloads.
Mutex misuse, message passing, and concurrency education
- Many describe negative experiences with misused mutexes and “voodoo” concurrency practices (e.g., random
volatileor locks). - Message passing / queues are favored by some as easier to reason about and debug, though they still rely on internal synchronization.
- Books and online resources about Java/C++/Rust memory models and atomics are recommended; a recurring theme is that “correctly synchronized, data-race‑free” code can often be reasoned about as if sequentially consistent.
Spinlocks, atomics, and low-level details
- Spinlocks can outperform mutexes in uncontended or extremely short critical sections because they avoid CAS on unlock and syscalls, but they risk wasting CPU and interacting badly with schedulers and QoS.
- Several nuanced discussions cover: CAS vs simple stores, acquire/release vs relaxed orderings, futex usage, backoff strategies, and platform quirks (x86
pause, Darwin QoS, Linuxsched_yieldbehavior). - Multiple commenters warn that using
volatilefor multithreading in C/C++ is incorrect; real atomics and fences should be used.
Benchmarking methodology
- Multiple participants criticize the article’s microbenchmark: it measures heavy contention on a single mutex with trivial work, which may reward pathological behaviors and not reflect real workloads.
- Suggested better benchmarks: large, real multithreaded apps with varied contention levels, critical-section lengths, and lock topologies; include uncontended and failed
try_lockcosts. - Some note that modern lock implementations already mix fast optimistic CAS, bounded spinning, and sleep/wake mechanisms; performance is highly workload- and architecture-dependent.
Cosmopolitan, APE, and adoption concerns
- Many find Cosmopolitan/APE technically impressive (fat, cross‑platform binaries, fast malloc, tuned primitives), but see them as clever hacks rather than obvious production defaults.
- Concerns include: reliance on subtle OS behaviors, potential future incompatibilities, “rough around the edges” status, and the difficulty of convincing conservative production teams.
- The author’s hyperbolic claims (e.g., implying professional irresponsibility in not adopting Cosmo) are seen by some as humor, by others as off‑putting or manipulative.
Why libc’s haven’t all switched
- Explanations offered: different priorities (stability over peak speed), limited maintainer time, ABI compatibility constraints, conservative attitudes, and the fact that “good enough” mutexes already exist.
- Some assert that many standard libraries leave performance on the table (allocators, string routines, hash maps), showing that “if it’s so good, it would already be adopted” is not a reliable argument.
Production priorities
- Several commenters stress that in production, reliability, predictability, and debuggability trump raw speed.
- Slow, instrumented locks that light up profilers can be preferable in development to force refactoring away from bad contention patterns.