Don't "optimize" conditional moves in shaders with mix()+step()
What “branching” means on GPUs
- Multiple commenters note a key distinction: a branch is a conditional jump that changes the program counter; a conditional move/select does not.
- On GPUs, threads in a warp/wavefront execute the same instruction stream. If threads disagree on a branch, the hardware usually runs both paths sequentially with masks, idling non‑taken lanes.
- If all lanes make the same decision (“uniform branch”), only one path runs and a real branch can be beneficial.
step()+mix() vs ternary/if
- The criticized pattern is using
step()+mix()to “avoid branches” that weren’t there to begin with; the original ternary compiles to conditional moves/selects, not jumps. step()itself is typically implemented as a conditional, so you’re just hiding logic, not removing it, and often adding extra arithmetic.- Some note that using
mix()with a boolean/vector mask is fine when that’s the natural form, but it’s not an optimization over a ternary that already works.
Performance tradeoffs and when branches hurt
- Divergent branches reduce effective throughput because portions of a warp do useless work; uniform branches can skip work and be faster.
- For short, cheap expressions, computing both sides and selecting is often best; for very asymmetric or expensive branches, a real branch can win.
- Several people emphasize: you can’t reliably reason this out in your head—profile on target hardware.
Compiler behavior and tooling
- Whether step/mix gets optimized back into a conditional move is compiler‑ and driver‑dependent; shader compilers are latency‑sensitive and can’t run every heavy optimization.
- There’s debate about adding passes to detect and undo the “fake optimization”; some say it’s straightforward pattern‑matching, others expect many variants and corner cases.
- Multiple tools are mentioned (DXIL, SPIR‑V, vendor ISAs, Radeon GPU Analyzer, driver disassembly) and people advocate inspecting generated code to see real branches, masking, and unrolling.
CPU conditional moves tangent
- A large subthread discusses
cmovon CPUs: sometimes faster than unpredictable branches, sometimes worse due to data dependencies and good branch predictors. - People complain about not being able to force a cmov in C/C++; compilers use heuristics, sometimes undo cmovs, and there are flags and intrinsics to influence this with mixed success.
Driver and ecosystem quirks
- GPU vendors sometimes replace or tweak game shaders in drivers for performance or correctness, sometimes keyed by executable name or shader hashes.
- This can yield big speedups but also odd behaviors and compatibility issues when games or mods deviate from what drivers expect.
Misinformation, LLMs, and best practices
- Commenters note that the “branches are always bad, use step/mix instead” meme is old, platform‑specific, and wrong for modern GPUs, yet persists online.
- LLMs are criticized for repeating this folklore, since they mirror common but incorrect advice.
- General guidance from the thread: write clear code (e.g., ternary/if), inspect generated code when in doubt, and measure on representative GPUs rather than relying on myths.