Something weird is happening with LLMs and chess

What’s surprising about chess and LLMs here

  • Most models in the experiment play very weak chess, even against the lowest Stockfish level.
  • One exception: gpt-3.5-turbo-instruct plays surprisingly well (roughly mid–amateur level in multiple users’ experience), and much better than newer or larger models from the same vendor and most open models.
  • Other experiments linked in the thread independently find the same outlier: 3.5‑turbo‑instruct has unusually strong, mostly‑legal play; other GPT models often blunder or propose illegal moves.

Hypotheses for why 3.5-turbo-instruct is good

  • Extra chess data / fine‑tuning:
    • Likely trained on many PGNs, possibly with targeted fine‑tuning or RLHF on strong moves and puzzles.
    • Fits with vendor documentation mentioning chess PGNs (≥1800 Elo games) in pretraining for at least one model family.
  • Prompt / interface effects:
    • 3.5‑turbo‑instruct works as a text completion model; when prompted to continue PGN exactly, performance jumps.
    • Chat-format models fare much worse; performance is extremely sensitive to notation, spaces, and using PGN vs English.
  • No external engine:
    • Evidence cited against a hidden chess engine: sensitivity to notation/history, illegal moves still occurring, logprob inspection, and behavior unlike classical engines.
  • Hidden engine / “cheating” theory:
    • Some argue a closed model might call a simple ~1800 Elo engine via tools, citing the vendor’s incentives and past overhyped demos.
    • Counterpoints stress complexity, low payoff, and employee claims there is no such special casing.

Evaluation and reproducibility concerns

  • Open models were quantized (Q5_K_M), which likely degrades play relative to full-precision closed models.
  • Temperature, sampling, grammar constraints, and resampling up to 10 times were used; these choices may heavily affect strength.
  • Different Stockfish versions and difficulty presets lead to conflicting replication attempts; some cannot reproduce “beats Stockfish” at all.
  • Commenters note small trial counts, limited hyperparameter sweeps, and that the author later hints they found a mundane explanation not yet disclosed.

Broader debates: tokenization and reasoning

  • Long subthread on whether LLM failures at counting, letter-count tasks, and chess are mainly:
    • Tokenization artifacts (e.g., “strawberry” as multi-token, integers not digitwise), or
    • Fundamental transformer limitations on sequential/algorithmic reasoning, mitigated by chain-of-thought prompting.
  • Others discuss alternative tokenization (character/byte-level, multi-tokenization schemes), but note severe cost and context tradeoffs and limited empirical gains so far.

Bigger-picture reflections

  • Some see chess weakness as expected: LLMs are sequence models, not planners or search engines.
  • Others note that strong 3.5‑instruct play and specialized transformer-based chess engines show transformers can encode nontrivial world models for formal games.
  • Several warn that benchmarks like chess will likely be directly optimized for in future models, blurring the line between “emergent ability” and targeted training.