Don't "optimize" conditional moves in shaders with mix()+step()

What “branching” means on GPUs

  • Multiple commenters note a key distinction: a branch is a conditional jump that changes the program counter; a conditional move/select does not.
  • On GPUs, threads in a warp/wavefront execute the same instruction stream. If threads disagree on a branch, the hardware usually runs both paths sequentially with masks, idling non‑taken lanes.
  • If all lanes make the same decision (“uniform branch”), only one path runs and a real branch can be beneficial.

step()+mix() vs ternary/if

  • The criticized pattern is using step() + mix() to “avoid branches” that weren’t there to begin with; the original ternary compiles to conditional moves/selects, not jumps.
  • step() itself is typically implemented as a conditional, so you’re just hiding logic, not removing it, and often adding extra arithmetic.
  • Some note that using mix() with a boolean/vector mask is fine when that’s the natural form, but it’s not an optimization over a ternary that already works.

Performance tradeoffs and when branches hurt

  • Divergent branches reduce effective throughput because portions of a warp do useless work; uniform branches can skip work and be faster.
  • For short, cheap expressions, computing both sides and selecting is often best; for very asymmetric or expensive branches, a real branch can win.
  • Several people emphasize: you can’t reliably reason this out in your head—profile on target hardware.

Compiler behavior and tooling

  • Whether step/mix gets optimized back into a conditional move is compiler‑ and driver‑dependent; shader compilers are latency‑sensitive and can’t run every heavy optimization.
  • There’s debate about adding passes to detect and undo the “fake optimization”; some say it’s straightforward pattern‑matching, others expect many variants and corner cases.
  • Multiple tools are mentioned (DXIL, SPIR‑V, vendor ISAs, Radeon GPU Analyzer, driver disassembly) and people advocate inspecting generated code to see real branches, masking, and unrolling.

CPU conditional moves tangent

  • A large subthread discusses cmov on CPUs: sometimes faster than unpredictable branches, sometimes worse due to data dependencies and good branch predictors.
  • People complain about not being able to force a cmov in C/C++; compilers use heuristics, sometimes undo cmovs, and there are flags and intrinsics to influence this with mixed success.

Driver and ecosystem quirks

  • GPU vendors sometimes replace or tweak game shaders in drivers for performance or correctness, sometimes keyed by executable name or shader hashes.
  • This can yield big speedups but also odd behaviors and compatibility issues when games or mods deviate from what drivers expect.

Misinformation, LLMs, and best practices

  • Commenters note that the “branches are always bad, use step/mix instead” meme is old, platform‑specific, and wrong for modern GPUs, yet persists online.
  • LLMs are criticized for repeating this folklore, since they mirror common but incorrect advice.
  • General guidance from the thread: write clear code (e.g., ternary/if), inspect generated code when in doubt, and measure on representative GPUs rather than relying on myths.