2024-11-14

Something weird is happening with LLMs and chess

What’s surprising about chess and LLMs here

Most models in the experiment play very weak chess, even against the lowest Stockfish level.
One exception: gpt-3.5-turbo-instruct plays surprisingly well (roughly mid–amateur level in multiple users’ experience), and much better than newer or larger models from the same vendor and most open models.
Other experiments linked in the thread independently find the same outlier: 3.5‑turbo‑instruct has unusually strong, mostly‑legal play; other GPT models often blunder or propose illegal moves.

Hypotheses for why 3.5-turbo-instruct is good

Extra chess data / fine‑tuning:
- Likely trained on many PGNs, possibly with targeted fine‑tuning or RLHF on strong moves and puzzles.
- Fits with vendor documentation mentioning chess PGNs (≥1800 Elo games) in pretraining for at least one model family.
Prompt / interface effects:
- 3.5‑turbo‑instruct works as a text completion model; when prompted to continue PGN exactly, performance jumps.
- Chat-format models fare much worse; performance is extremely sensitive to notation, spaces, and using PGN vs English.
No external engine:
- Evidence cited against a hidden chess engine: sensitivity to notation/history, illegal moves still occurring, logprob inspection, and behavior unlike classical engines.
Hidden engine / “cheating” theory:
- Some argue a closed model might call a simple ~1800 Elo engine via tools, citing the vendor’s incentives and past overhyped demos.
- Counterpoints stress complexity, low payoff, and employee claims there is no such special casing.

Evaluation and reproducibility concerns

Open models were quantized (Q5_K_M), which likely degrades play relative to full-precision closed models.
Temperature, sampling, grammar constraints, and resampling up to 10 times were used; these choices may heavily affect strength.
Different Stockfish versions and difficulty presets lead to conflicting replication attempts; some cannot reproduce “beats Stockfish” at all.
Commenters note small trial counts, limited hyperparameter sweeps, and that the author later hints they found a mundane explanation not yet disclosed.

Broader debates: tokenization and reasoning

Long subthread on whether LLM failures at counting, letter-count tasks, and chess are mainly:
- Tokenization artifacts (e.g., “strawberry” as multi-token, integers not digitwise), or
- Fundamental transformer limitations on sequential/algorithmic reasoning, mitigated by chain-of-thought prompting.
Others discuss alternative tokenization (character/byte-level, multi-tokenization schemes), but note severe cost and context tradeoffs and limited empirical gains so far.

Bigger-picture reflections

Some see chess weakness as expected: LLMs are sequence models, not planners or search engines.
Others note that strong 3.5‑instruct play and specialized transformer-based chess engines show transformers can encode nontrivial world models for formal games.
Several warn that benchmarks like chess will likely be directly optimized for in future models, blurring the line between “emergent ability” and targeted training.

Related topics