Performance of the Python 3.14 tail-call interpreter

Revised performance story and LLVM regression

  • The tail-call interpreter in Python 3.14 is a real speedup, but much smaller than the originally publicized 10–15%; on non-buggy compilers it’s more like 1–5%.
  • The big apparent win came from an LLVM 19 optimization regression that made the old computed-goto interpreter slower, so the new interpreter looked much better by comparison.
  • CPython’s official Linux builds use GCC, which is why the LLVM regression went unnoticed; the new interpreter also depends on a clang-19-only feature (preserve_none), so the bug and the new design landed together.
  • LLVM has since merged a fix, but it’s heuristic-based; there’s no hard guarantee similar issues won’t recur.

Why keep the tail-call interpreter?

  • Even at 1–5%, a global runtime speedup is considered significant for a mature VM.
  • The change is generated from a DSL, so source complexity stays manageable; the main cost is in autogenerated code.
  • Tail-call style plus attributes (musttail, noinline, preserve_none) gives maintainers more control over control flow and stack behavior, making performance more robust to compiler heuristics and LTO/PGO variance.

Benchmarking is hard and fragile

  • Commenters share experiences where code layout, alignment, and “linker lottery” produce double-digit percentage swings with no logical code change.
  • Tools and techniques mentioned: randomizing layout (e.g. Stabilizer, linker padding options), causal profilers (Coz), and running across multiple CPUs/compilers with error bars.
  • There’s criticism of ad hoc benchmarks on overloaded laptops; the CPython team points to a dedicated benchmarking suite and explains why they avoid constantly changing compiler versions there.

C, compilers, and “portable assembly” debate

  • Long subthread argues over whether C is “portable assembly” or “close to the metal.”
  • Examples show compilers:
    • Eliminating or moving a += 1 when results are provably unused or constant-foldable.
    • Exploiting undefined behavior (e.g. signed overflow) to delete checks or branches.
    • Autovectorizing and radically reshaping loops.
  • Some see C as still much more transparent than C++, others argue modern optimizers and UB make reasoning about emitted machine code increasingly unreliable—exactly the problem for tight interpreter loops.

Python version performance and project trajectory

  • Several users report Python 3.12/3.13 being slower than 3.11 in loops and web workloads; a specific loop regression issue is referenced.
  • The “faster CPython” effort is said to have already delivered ~1.6× over 3.10 (more with JIT), with an eventual 5× goal; progress is incremental and compounded over releases.