2025-02-09

Don't "optimize" conditional moves in shaders with mix()+step()

What “branching” means on GPUs

Multiple commenters note a key distinction: a branch is a conditional jump that changes the program counter; a conditional move/select does not.
On GPUs, threads in a warp/wavefront execute the same instruction stream. If threads disagree on a branch, the hardware usually runs both paths sequentially with masks, idling non‑taken lanes.
If all lanes make the same decision (“uniform branch”), only one path runs and a real branch can be beneficial.

step()+mix() vs ternary/if

The criticized pattern is using step() + mix() to “avoid branches” that weren’t there to begin with; the original ternary compiles to conditional moves/selects, not jumps.
step() itself is typically implemented as a conditional, so you’re just hiding logic, not removing it, and often adding extra arithmetic.
Some note that using mix() with a boolean/vector mask is fine when that’s the natural form, but it’s not an optimization over a ternary that already works.

Performance tradeoffs and when branches hurt

Divergent branches reduce effective throughput because portions of a warp do useless work; uniform branches can skip work and be faster.
For short, cheap expressions, computing both sides and selecting is often best; for very asymmetric or expensive branches, a real branch can win.
Several people emphasize: you can’t reliably reason this out in your head—profile on target hardware.

Compiler behavior and tooling

Whether step/mix gets optimized back into a conditional move is compiler‑ and driver‑dependent; shader compilers are latency‑sensitive and can’t run every heavy optimization.
There’s debate about adding passes to detect and undo the “fake optimization”; some say it’s straightforward pattern‑matching, others expect many variants and corner cases.
Multiple tools are mentioned (DXIL, SPIR‑V, vendor ISAs, Radeon GPU Analyzer, driver disassembly) and people advocate inspecting generated code to see real branches, masking, and unrolling.

CPU conditional moves tangent

A large subthread discusses cmov on CPUs: sometimes faster than unpredictable branches, sometimes worse due to data dependencies and good branch predictors.
People complain about not being able to force a cmov in C/C++; compilers use heuristics, sometimes undo cmovs, and there are flags and intrinsics to influence this with mixed success.

Driver and ecosystem quirks

GPU vendors sometimes replace or tweak game shaders in drivers for performance or correctness, sometimes keyed by executable name or shader hashes.
This can yield big speedups but also odd behaviors and compatibility issues when games or mods deviate from what drivers expect.

Misinformation, LLMs, and best practices

Commenters note that the “branches are always bad, use step/mix instead” meme is old, platform‑specific, and wrong for modern GPUs, yet persists online.
LLMs are criticized for repeating this folklore, since they mirror common but incorrect advice.
General guidance from the thread: write clear code (e.g., ternary/if), inspect generated code when in doubt, and measure on representative GPUs rather than relying on myths.

Related topics