2025-03-10

Performance of the Python 3.14 tail-call interpreter

Revised performance story and LLVM regression

The tail-call interpreter in Python 3.14 is a real speedup, but much smaller than the originally publicized 10–15%; on non-buggy compilers it’s more like 1–5%.
The big apparent win came from an LLVM 19 optimization regression that made the old computed-goto interpreter slower, so the new interpreter looked much better by comparison.
CPython’s official Linux builds use GCC, which is why the LLVM regression went unnoticed; the new interpreter also depends on a clang-19-only feature (preserve_none), so the bug and the new design landed together.
LLVM has since merged a fix, but it’s heuristic-based; there’s no hard guarantee similar issues won’t recur.

Why keep the tail-call interpreter?

Even at 1–5%, a global runtime speedup is considered significant for a mature VM.
The change is generated from a DSL, so source complexity stays manageable; the main cost is in autogenerated code.
Tail-call style plus attributes (musttail, noinline, preserve_none) gives maintainers more control over control flow and stack behavior, making performance more robust to compiler heuristics and LTO/PGO variance.

Benchmarking is hard and fragile

Commenters share experiences where code layout, alignment, and “linker lottery” produce double-digit percentage swings with no logical code change.
Tools and techniques mentioned: randomizing layout (e.g. Stabilizer, linker padding options), causal profilers (Coz), and running across multiple CPUs/compilers with error bars.
There’s criticism of ad hoc benchmarks on overloaded laptops; the CPython team points to a dedicated benchmarking suite and explains why they avoid constantly changing compiler versions there.

C, compilers, and “portable assembly” debate

Long subthread argues over whether C is “portable assembly” or “close to the metal.”
Examples show compilers:
- Eliminating or moving a += 1 when results are provably unused or constant-foldable.
- Exploiting undefined behavior (e.g. signed overflow) to delete checks or branches.
- Autovectorizing and radically reshaping loops.
Some see C as still much more transparent than C++, others argue modern optimizers and UB make reasoning about emitted machine code increasingly unreliable—exactly the problem for tight interpreter loops.

Python version performance and project trajectory

Several users report Python 3.12/3.13 being slower than 3.11 in loops and web workloads; a specific loop regression issue is referenced.
The “faster CPython” effort is said to have already delivered ~1.6× over 3.10 (more with JIT), with an eventual 5× goal; progress is incremental and compounded over releases.

Related topics