Do not taunt happy fun branch predictor (2023)
Branch prediction and call/return behavior
- Several commenters note the core issue isn’t “fancy” prediction but violating basic assumptions: using mismatched call/return (BL/RET) breaks the return-address predictor’s shadow stack.
- This can cause large slowdowns and, on systems with architectural shadow call stacks, outright crashes.
- Some point out prior writeups of the same trap and highlight a simple fix: keep calls/returns balanced and, if you inline or restructure, replace RET with an ordinary branch (e.g.,
BR LR) so the predictor’s call/return pairing stays consistent. - Others note that some modern x86 CPUs special-case common idioms like
CALL + 0to avoid polluting the return stack, so older “pessimizations” may no longer apply.
Hardware prediction vs explicit hints
- A tangent asks why CPUs guess branch/load behavior instead of letting programmers specify it.
- Replies argue that:
- Extra hint instructions cost decode bandwidth and code size.
- Most existing code predates sophisticated prediction, and hardware must accelerate unannotated binaries.
- Capabilities differ across generations; Itanium/VLIW showed that relying on compilers for scheduling/general-purpose prediction failed commercially.
- High-level hints do exist (GCC/Clang likely/unlikely, C builtins, kernel macros), but:
- They’re incomplete or wrong in many places.
- Modern predictors often outperform explicit hints; extra hint instructions can even slow code.
- One detailed comment explains how an M1-like core must predict returns before decoding the RET, using a dedicated return-address stack.
Floating‑point summation, SIMD, and determinism
- Debate centers on why compilers don’t auto-vectorize float summations:
- One side: summation is inherently approximate; any result within known error bounds is valid, so compilers should freely reorder and SIMD-ize for speed.
- Other side: bit-for-bit reproducibility across builds, compilers, and CPUs is critical in many domains; implicit reordering breaks that and complicates debugging.
- Examples show that reordering additions of very different magnitudes can flip results dramatically.
- Libraries and toolchains vary: some use SIMD by default only for integer sums or provide flags to trade speed vs reproducibility; fast-math flags are described as powerful but numerically risky.
- More advanced summation algorithms (pairwise, Kahan, superaccumulator/xsum) are mentioned as ways to improve or make results exact, at some performance cost.
Performance and software bloat
- Commenters marvel at the nanosecond-scale loop times relative to 1 MHz-era CPUs, but contrast this with how “slow” modern desktop software feels.
- There’s a recurring theme that developer productivity stacks (Electron, heavy frameworks) trade away large amounts of performance; some see this as rational (time-to-market), others as user-hostile.
- Progress indicators are seen as masking, not fixing, unnecessary latency in modern UIs.
Language and assembly style notes
- Some dislike dense C idioms with side effects (e.g.,
*p++), praising languages that forbid them for clarity. - There’s discussion of addressing modes: AArch64’s post-indexing (
ldr s1, [x0], #4) vs x86 string ops (lods/stos/movs), and how they express “load and increment” patterns. - Minor nits include unit switching (µs vs ns) and a mid-article switch from C to Rust, which some found visually confusing but not conceptually important.