Performance of the Python 3.14 tail-call interpreter
Revised performance story and LLVM regression
- The tail-call interpreter in Python 3.14 is a real speedup, but much smaller than the originally publicized 10–15%; on non-buggy compilers it’s more like 1–5%.
- The big apparent win came from an LLVM 19 optimization regression that made the old computed-goto interpreter slower, so the new interpreter looked much better by comparison.
- CPython’s official Linux builds use GCC, which is why the LLVM regression went unnoticed; the new interpreter also depends on a clang-19-only feature (
preserve_none), so the bug and the new design landed together. - LLVM has since merged a fix, but it’s heuristic-based; there’s no hard guarantee similar issues won’t recur.
Why keep the tail-call interpreter?
- Even at 1–5%, a global runtime speedup is considered significant for a mature VM.
- The change is generated from a DSL, so source complexity stays manageable; the main cost is in autogenerated code.
- Tail-call style plus attributes (
musttail,noinline,preserve_none) gives maintainers more control over control flow and stack behavior, making performance more robust to compiler heuristics and LTO/PGO variance.
Benchmarking is hard and fragile
- Commenters share experiences where code layout, alignment, and “linker lottery” produce double-digit percentage swings with no logical code change.
- Tools and techniques mentioned: randomizing layout (e.g. Stabilizer, linker padding options), causal profilers (Coz), and running across multiple CPUs/compilers with error bars.
- There’s criticism of ad hoc benchmarks on overloaded laptops; the CPython team points to a dedicated benchmarking suite and explains why they avoid constantly changing compiler versions there.
C, compilers, and “portable assembly” debate
- Long subthread argues over whether C is “portable assembly” or “close to the metal.”
- Examples show compilers:
- Eliminating or moving
a += 1when results are provably unused or constant-foldable. - Exploiting undefined behavior (e.g. signed overflow) to delete checks or branches.
- Autovectorizing and radically reshaping loops.
- Eliminating or moving
- Some see C as still much more transparent than C++, others argue modern optimizers and UB make reasoning about emitted machine code increasingly unreliable—exactly the problem for tight interpreter loops.
Python version performance and project trajectory
- Several users report Python 3.12/3.13 being slower than 3.11 in loops and web workloads; a specific loop regression issue is referenced.
- The “faster CPython” effort is said to have already delivered ~1.6× over 3.10 (more with JIT), with an eventual 5× goal; progress is incremental and compounded over releases.