2024-07-03

Do not taunt happy fun branch predictor (2023)

Branch prediction and call/return behavior

Several commenters note the core issue isn’t “fancy” prediction but violating basic assumptions: using mismatched call/return (BL/RET) breaks the return-address predictor’s shadow stack.
This can cause large slowdowns and, on systems with architectural shadow call stacks, outright crashes.
Some point out prior writeups of the same trap and highlight a simple fix: keep calls/returns balanced and, if you inline or restructure, replace RET with an ordinary branch (e.g., BR LR) so the predictor’s call/return pairing stays consistent.
Others note that some modern x86 CPUs special-case common idioms like CALL + 0 to avoid polluting the return stack, so older “pessimizations” may no longer apply.

Hardware prediction vs explicit hints

A tangent asks why CPUs guess branch/load behavior instead of letting programmers specify it.
Replies argue that:
- Extra hint instructions cost decode bandwidth and code size.
- Most existing code predates sophisticated prediction, and hardware must accelerate unannotated binaries.
- Capabilities differ across generations; Itanium/VLIW showed that relying on compilers for scheduling/general-purpose prediction failed commercially.
High-level hints do exist (GCC/Clang likely/unlikely, C builtins, kernel macros), but:
- They’re incomplete or wrong in many places.
- Modern predictors often outperform explicit hints; extra hint instructions can even slow code.
One detailed comment explains how an M1-like core must predict returns before decoding the RET, using a dedicated return-address stack.

Floating‑point summation, SIMD, and determinism

Debate centers on why compilers don’t auto-vectorize float summations:
- One side: summation is inherently approximate; any result within known error bounds is valid, so compilers should freely reorder and SIMD-ize for speed.
- Other side: bit-for-bit reproducibility across builds, compilers, and CPUs is critical in many domains; implicit reordering breaks that and complicates debugging.
Examples show that reordering additions of very different magnitudes can flip results dramatically.
Libraries and toolchains vary: some use SIMD by default only for integer sums or provide flags to trade speed vs reproducibility; fast-math flags are described as powerful but numerically risky.
More advanced summation algorithms (pairwise, Kahan, superaccumulator/xsum) are mentioned as ways to improve or make results exact, at some performance cost.

Performance and software bloat

Commenters marvel at the nanosecond-scale loop times relative to 1 MHz-era CPUs, but contrast this with how “slow” modern desktop software feels.
There’s a recurring theme that developer productivity stacks (Electron, heavy frameworks) trade away large amounts of performance; some see this as rational (time-to-market), others as user-hostile.
Progress indicators are seen as masking, not fixing, unnecessary latency in modern UIs.

Language and assembly style notes

Some dislike dense C idioms with side effects (e.g., *p++), praising languages that forbid them for clarity.
There’s discussion of addressing modes: AArch64’s post-indexing (ldr s1, [x0], #4) vs x86 string ops (lods/stos/movs), and how they express “load and increment” patterns.
Minor nits include unit switching (µs vs ns) and a mid-article switch from C to Rust, which some found visually confusing but not conceptually important.

Related topics