2025-12-25

Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster

Technical change: tail-call interpreter vs computed goto

The new CPython Windows x86-64 interpreter uses a tail-calling dispatch loop instead of a giant switch/case or computed-goto loop.
The current eval loop is ~12k lines in a single function; this breaks many compiler heuristics, especially inlining, leading compilers to refuse to inline even trivial helpers.
Tail calls split the interpreter into smaller functions and “reset” optimizer heuristics at each step, which seems to yield most of the speedup, more than just register reuse.
There’s discussion that this structure is also friendlier to CPU branch predictors than a single large dispatch loop.

MSVC specifics and musttail

The speedup on Windows hinges on MSVC’s [[msvc::musttail]] and __preserve_none attributes to enforce tail calls and control calling conventions.
There’s some concern about relying on relatively new / experimental compiler features, but CPython keeps three interpreters (switch, computed goto, tail-call) and can fall back to the classic one if MSVC behavior regresses.
Dispatch is autogenerated and selectable via build flags, so maintenance costs are said to be low aside from a few hundred lines of MSVC-specific glue.
A side thread notes syntax quirks of __preserve_none vs GCC attributes and that musttail is documented, contrary to the blog’s initial implication.

Performance, JITs, and expectations

Some commenters see ~15% as “low-hanging fruit” that should have been done long ago; others argue this level of attention and rapid use of fresh MSVC features shows the core loop is already heavily optimized.
Debate over whether micro-optimizing an interpreter is worth it versus adding a JIT; multiple replies say naïvely JIT-compiling Python bytecode gives limited gains because most cost lies in dynamic dispatch and object semantics.
Broader context: CPython 3.11–3.14 are reported to be significantly faster than 3.9–3.10, though still much slower than PyPy or JavaScript engines.

Language semantics and ecosystem constraints

Several comments contrast Python’s extreme dynamism and stable C extension ABI with JavaScript’s situation: these make deep optimization and JITing harder without breaking existing C extensions or semantics.
Faster alternative runtimes like PyPy exist but trade off C-API compatibility and are less used where NumPy and other C-heavy libraries dominate.

Other tangents

Complaints that Python’s real usability pain is packaging and startup/import time; lazy imports (PEP 810) are mentioned as a future improvement.
Interest in Python GUI tooling on Windows (wxPython, Qt, ImGui) and appreciation that a faster Windows interpreter directly benefits such use cases.
Some meta-discussion about benchmarking (violin plots, histogram tradeoffs) and praise for the author’s transparency after an earlier LLVM-related benchmarking misinterpretation.

Related topics