2024-07-26

Zen 5's 2-ahead branch predictor: how a 30 year old idea allows for new tricks

Understanding Zen 5’s 2‑ahead branch predictor

Core idea: conventional predictors guess the next basic block; 2‑ahead prediction tries to predict the block after the next one using information from the current block.
This lets the frontend fetch and decode two future blocks in parallel, helping to keep multiple decoders and a wide pipeline busy.
It’s especially helpful for ISAs with variable-length instructions (x86, possibly RISC‑V), where knowing the exact next PC early is important to start decoding.
Some commenters stress that modern OoO cores already speculate across many branches; the innovation is in how fetch/decode is organized and pipelined, not “speculating only two ahead.”
Several readers still find the article unclear on the precise hardware mechanism; details remain “unclear” in the thread.

Why not execute both sides of every branch?

Doubling work on every branch wastes energy and execution bandwidth when predictors are already ~99% accurate on many workloads.
For deeply speculative frontends, following multiple paths would explode combinatorially (2,4,8,16… paths).
Existing speculative/out-of-order machinery already handles mispredictions efficiently; better prediction is usually cheaper than dual-path execution.
Some note that GPUs effectively do “both sides” for divergent code, and it’s bad for general scalar workloads.

SMT, pipeline utilization, and wide cores

Good SMT speedups often indicate underutilized resources (“pipeline bubbles”).
As cores get wider (more ALUs/AGUs, wider dispatch), a single thread rarely saturates them, so SMT and better branch prediction become more valuable.
Others argue that as OoO improves, SMT gains can shrink; Zen 5’s much wider core may reverse that and increase SMT benefits.

Security and speculative execution

Branch prediction itself isn’t inherently the vulnerability; attacks exploit speculative execution’s interaction with caches, TLBs, and timing.
Speculation is considered too valuable to remove; mitigations focus on isolation, memory model behavior, and removing fine-grained timers in some environments.

Old ideas becoming practical

2‑ahead prediction is based on 1990s “multiple-block ahead” research now viable given modern tradeoffs.
Thread draws parallels to Z‑buffers, ray tracing, EEVDF scheduling, LDPC codes, PEG parsing, modern GC, and Rust’s type system as older ideas that became mainstream once hardware or ecosystem caught up.

Cores, memory bandwidth, and workloads

Some see massive core counts (Zen 5c/6c) and advanced prediction as making “kilo-core” scale on a single box viable for many web workloads.
Others note real bottlenecks often lie in databases, memory bandwidth, or network I/O, not just raw core or branch predictor capability.

Related topics