Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster
Technical change: tail-call interpreter vs computed goto
- The new CPython Windows x86-64 interpreter uses a tail-calling dispatch loop instead of a giant switch/case or computed-goto loop.
- The current eval loop is ~12k lines in a single function; this breaks many compiler heuristics, especially inlining, leading compilers to refuse to inline even trivial helpers.
- Tail calls split the interpreter into smaller functions and “reset” optimizer heuristics at each step, which seems to yield most of the speedup, more than just register reuse.
- There’s discussion that this structure is also friendlier to CPU branch predictors than a single large dispatch loop.
MSVC specifics and musttail
- The speedup on Windows hinges on MSVC’s
[[msvc::musttail]]and__preserve_noneattributes to enforce tail calls and control calling conventions. - There’s some concern about relying on relatively new / experimental compiler features, but CPython keeps three interpreters (switch, computed goto, tail-call) and can fall back to the classic one if MSVC behavior regresses.
- Dispatch is autogenerated and selectable via build flags, so maintenance costs are said to be low aside from a few hundred lines of MSVC-specific glue.
- A side thread notes syntax quirks of
__preserve_nonevs GCC attributes and thatmusttailis documented, contrary to the blog’s initial implication.
Performance, JITs, and expectations
- Some commenters see ~15% as “low-hanging fruit” that should have been done long ago; others argue this level of attention and rapid use of fresh MSVC features shows the core loop is already heavily optimized.
- Debate over whether micro-optimizing an interpreter is worth it versus adding a JIT; multiple replies say naïvely JIT-compiling Python bytecode gives limited gains because most cost lies in dynamic dispatch and object semantics.
- Broader context: CPython 3.11–3.14 are reported to be significantly faster than 3.9–3.10, though still much slower than PyPy or JavaScript engines.
Language semantics and ecosystem constraints
- Several comments contrast Python’s extreme dynamism and stable C extension ABI with JavaScript’s situation: these make deep optimization and JITing harder without breaking existing C extensions or semantics.
- Faster alternative runtimes like PyPy exist but trade off C-API compatibility and are less used where NumPy and other C-heavy libraries dominate.
Other tangents
- Complaints that Python’s real usability pain is packaging and startup/import time; lazy imports (PEP 810) are mentioned as a future improvement.
- Interest in Python GUI tooling on Windows (wxPython, Qt, ImGui) and appreciation that a faster Windows interpreter directly benefits such use cases.
- Some meta-discussion about benchmarking (violin plots, histogram tradeoffs) and praise for the author’s transparency after an earlier LLVM-related benchmarking misinterpretation.