Something weird is happening with LLMs and chess
What’s surprising about chess and LLMs here
- Most models in the experiment play very weak chess, even against the lowest Stockfish level.
- One exception:
gpt-3.5-turbo-instructplays surprisingly well (roughly mid–amateur level in multiple users’ experience), and much better than newer or larger models from the same vendor and most open models. - Other experiments linked in the thread independently find the same outlier: 3.5‑turbo‑instruct has unusually strong, mostly‑legal play; other GPT models often blunder or propose illegal moves.
Hypotheses for why 3.5-turbo-instruct is good
- Extra chess data / fine‑tuning:
- Likely trained on many PGNs, possibly with targeted fine‑tuning or RLHF on strong moves and puzzles.
- Fits with vendor documentation mentioning chess PGNs (≥1800 Elo games) in pretraining for at least one model family.
- Prompt / interface effects:
- 3.5‑turbo‑instruct works as a text completion model; when prompted to continue PGN exactly, performance jumps.
- Chat-format models fare much worse; performance is extremely sensitive to notation, spaces, and using PGN vs English.
- No external engine:
- Evidence cited against a hidden chess engine: sensitivity to notation/history, illegal moves still occurring, logprob inspection, and behavior unlike classical engines.
- Hidden engine / “cheating” theory:
- Some argue a closed model might call a simple ~1800 Elo engine via tools, citing the vendor’s incentives and past overhyped demos.
- Counterpoints stress complexity, low payoff, and employee claims there is no such special casing.
Evaluation and reproducibility concerns
- Open models were quantized (Q5_K_M), which likely degrades play relative to full-precision closed models.
- Temperature, sampling, grammar constraints, and resampling up to 10 times were used; these choices may heavily affect strength.
- Different Stockfish versions and difficulty presets lead to conflicting replication attempts; some cannot reproduce “beats Stockfish” at all.
- Commenters note small trial counts, limited hyperparameter sweeps, and that the author later hints they found a mundane explanation not yet disclosed.
Broader debates: tokenization and reasoning
- Long subthread on whether LLM failures at counting, letter-count tasks, and chess are mainly:
- Tokenization artifacts (e.g., “strawberry” as multi-token, integers not digitwise), or
- Fundamental transformer limitations on sequential/algorithmic reasoning, mitigated by chain-of-thought prompting.
- Others discuss alternative tokenization (character/byte-level, multi-tokenization schemes), but note severe cost and context tradeoffs and limited empirical gains so far.
Bigger-picture reflections
- Some see chess weakness as expected: LLMs are sequence models, not planners or search engines.
- Others note that strong 3.5‑instruct play and specialized transformer-based chess engines show transformers can encode nontrivial world models for formal games.
- Several warn that benchmarks like chess will likely be directly optimized for in future models, blurring the line between “emergent ability” and targeted training.