Make your program slower with threads (2014)

Threads and Performance Pitfalls

  • Many commenters echoed the “rite of passage” of assuming more threads ⇒ more speed, then seeing slowdowns instead.
  • Common reasons: lock contention, hidden shared state in libraries (e.g., random number generators, allocators, stdio), and oversubscription (more threads than cores).
  • Anecdotes: early multithreading on single‑CPU systems (e.g., OS/2) led to crashes and mysterious bugs until code was simplified back to a single loop.
  • Another case: futex and user‑space spinlock congestion (e.g., in a custom allocator) caused an orders‑of‑magnitude slowdown when many threads pushed to vectors.

Libc, Global State, and RNG Contention

  • A central theme: libc functions often rely on global or shared state and were not originally designed for heavy multithreading.
  • random()/rand() use global state and locks, causing severe contention in multithreaded code. Thread‑safe variants (random_r, per‑thread RNGs) are recommended.
  • Some argue this is a design flaw; others counter it was reasonable in the original single‑threaded context but is problematic today.
  • Similar global-state issues exist with stdio, allocators, and errno (though some implementations make errno thread‑local).

PRNG Design and Parallelism

  • Consensus: parallel code should avoid multiple threads sharing a single PRNG state.
  • Suggested patterns: per‑thread PRNGs with distinct, uncorrelated seeds, sometimes derived from a CSPRNG or OS entropy; or counter‑based RNGs for deterministic splitting.
  • Debate over whether “PRNGs need global state”: one side says any single PRNG must track shared position in its sequence; others emphasize that sharding state (per thread / per object) is usually better.
  • C’s requirement that omitting srand be equivalent to srand(1) is seen as blocking some optimizations that other languages (e.g., Go) use with thread‑local RNGs.

Alternatives, Languages, and Libraries

  • Examples praised: Go’s dual RNG behavior (global vs thread‑local), Zig’s thread‑local RNG, Java’s ThreadLocalRandom, Rust’s emphasis on data‑race freedom.
  • Some suggest using alternative libcs (e.g., musl) or custom libraries rather than default libc.

Broader Reflections: Optimization, Types, and Abstractions

  • The classic “premature optimization” advice is debated: some say modern code is so inefficient everywhere that hotspots barely exist; others insist profiling should still drive optimization.
  • Several comments lament “leaky abstractions”: a call like “give me a random number” hides global state, locks, and determinism trade‑offs.
  • There is interest in richer type systems and effects/capability systems (monads, algebraic effects, purity) to track side effects (IO, allocation, global state) and make such issues more visible.