Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster

Technical change: tail-call interpreter vs computed goto

  • The new CPython Windows x86-64 interpreter uses a tail-calling dispatch loop instead of a giant switch/case or computed-goto loop.
  • The current eval loop is ~12k lines in a single function; this breaks many compiler heuristics, especially inlining, leading compilers to refuse to inline even trivial helpers.
  • Tail calls split the interpreter into smaller functions and “reset” optimizer heuristics at each step, which seems to yield most of the speedup, more than just register reuse.
  • There’s discussion that this structure is also friendlier to CPU branch predictors than a single large dispatch loop.

MSVC specifics and musttail

  • The speedup on Windows hinges on MSVC’s [[msvc::musttail]] and __preserve_none attributes to enforce tail calls and control calling conventions.
  • There’s some concern about relying on relatively new / experimental compiler features, but CPython keeps three interpreters (switch, computed goto, tail-call) and can fall back to the classic one if MSVC behavior regresses.
  • Dispatch is autogenerated and selectable via build flags, so maintenance costs are said to be low aside from a few hundred lines of MSVC-specific glue.
  • A side thread notes syntax quirks of __preserve_none vs GCC attributes and that musttail is documented, contrary to the blog’s initial implication.

Performance, JITs, and expectations

  • Some commenters see ~15% as “low-hanging fruit” that should have been done long ago; others argue this level of attention and rapid use of fresh MSVC features shows the core loop is already heavily optimized.
  • Debate over whether micro-optimizing an interpreter is worth it versus adding a JIT; multiple replies say naïvely JIT-compiling Python bytecode gives limited gains because most cost lies in dynamic dispatch and object semantics.
  • Broader context: CPython 3.11–3.14 are reported to be significantly faster than 3.9–3.10, though still much slower than PyPy or JavaScript engines.

Language semantics and ecosystem constraints

  • Several comments contrast Python’s extreme dynamism and stable C extension ABI with JavaScript’s situation: these make deep optimization and JITing harder without breaking existing C extensions or semantics.
  • Faster alternative runtimes like PyPy exist but trade off C-API compatibility and are less used where NumPy and other C-heavy libraries dominate.

Other tangents

  • Complaints that Python’s real usability pain is packaging and startup/import time; lazy imports (PEP 810) are mentioned as a future improvement.
  • Interest in Python GUI tooling on Windows (wxPython, Qt, ImGui) and appreciation that a faster Windows interpreter directly benefits such use cases.
  • Some meta-discussion about benchmarking (violin plots, histogram tradeoffs) and praise for the author’s transparency after an earlier LLVM-related benchmarking misinterpretation.