Do not taunt happy fun branch predictor (2023)

Branch prediction and call/return behavior

  • Several commenters note the core issue isn’t “fancy” prediction but violating basic assumptions: using mismatched call/return (BL/RET) breaks the return-address predictor’s shadow stack.
  • This can cause large slowdowns and, on systems with architectural shadow call stacks, outright crashes.
  • Some point out prior writeups of the same trap and highlight a simple fix: keep calls/returns balanced and, if you inline or restructure, replace RET with an ordinary branch (e.g., BR LR) so the predictor’s call/return pairing stays consistent.
  • Others note that some modern x86 CPUs special-case common idioms like CALL + 0 to avoid polluting the return stack, so older “pessimizations” may no longer apply.

Hardware prediction vs explicit hints

  • A tangent asks why CPUs guess branch/load behavior instead of letting programmers specify it.
  • Replies argue that:
    • Extra hint instructions cost decode bandwidth and code size.
    • Most existing code predates sophisticated prediction, and hardware must accelerate unannotated binaries.
    • Capabilities differ across generations; Itanium/VLIW showed that relying on compilers for scheduling/general-purpose prediction failed commercially.
  • High-level hints do exist (GCC/Clang likely/unlikely, C builtins, kernel macros), but:
    • They’re incomplete or wrong in many places.
    • Modern predictors often outperform explicit hints; extra hint instructions can even slow code.
  • One detailed comment explains how an M1-like core must predict returns before decoding the RET, using a dedicated return-address stack.

Floating‑point summation, SIMD, and determinism

  • Debate centers on why compilers don’t auto-vectorize float summations:
    • One side: summation is inherently approximate; any result within known error bounds is valid, so compilers should freely reorder and SIMD-ize for speed.
    • Other side: bit-for-bit reproducibility across builds, compilers, and CPUs is critical in many domains; implicit reordering breaks that and complicates debugging.
  • Examples show that reordering additions of very different magnitudes can flip results dramatically.
  • Libraries and toolchains vary: some use SIMD by default only for integer sums or provide flags to trade speed vs reproducibility; fast-math flags are described as powerful but numerically risky.
  • More advanced summation algorithms (pairwise, Kahan, superaccumulator/xsum) are mentioned as ways to improve or make results exact, at some performance cost.

Performance and software bloat

  • Commenters marvel at the nanosecond-scale loop times relative to 1 MHz-era CPUs, but contrast this with how “slow” modern desktop software feels.
  • There’s a recurring theme that developer productivity stacks (Electron, heavy frameworks) trade away large amounts of performance; some see this as rational (time-to-market), others as user-hostile.
  • Progress indicators are seen as masking, not fixing, unnecessary latency in modern UIs.

Language and assembly style notes

  • Some dislike dense C idioms with side effects (e.g., *p++), praising languages that forbid them for clarity.
  • There’s discussion of addressing modes: AArch64’s post-indexing (ldr s1, [x0], #4) vs x86 string ops (lods/stos/movs), and how they express “load and increment” patterns.
  • Minor nits include unit switching (µs vs ns) and a mid-article switch from C to Rust, which some found visually confusing but not conceptually important.