2024-11-21

OK, I can partly explain the LLM chess weirdness now

LLM Chess Ability & Illegal Moves

Many note that GPT‑3.5‑turbo‑instruct plays around lichess ~1750–1800 level and rarely makes illegal moves, which is non‑trivial given it only sees text move lists, not board state.
Others push back: even a single illegal move in a purely rules‑based game shows it hasn’t fully internalized the rules, especially compared to human players of similar rating who almost never make illegal moves with a visible board.
Several distinguish “chessy” illegal moves (subtle king‑in‑check issues) from blatantly impossible ones (moving non‑existent pieces or off‑board squares); skeptics say current LLMs still do the latter.

Understanding, Reasoning, and “Just Token Prediction”

One group argues that effective next‑token prediction over complex domains (chess, math, code, word problems) is a form of reasoning and world‑modeling, even if implemented via statistics.
Skeptics counter that:
- Models fail badly on modified or unfamiliar tasks, or when irrelevant info is added.
- They hallucinate reasoning steps post‑hoc and can regurgitate training data.
- “Appearing to reason” is not the same as reliable, systematic reasoning.
Debate extends into Turing‑completeness: some say being Turing‑complete means, in principle, LLMs could implement reasoning; others call that practically irrelevant given efficiency and reliability constraints.

Training Data, Fine‑Tuning & Model Differences

A widely supported hypothesis: OpenAI trained some base models on many high‑quality chess games (e.g., Elo ≥1800 PGNs). That explains why GPT‑3.5‑turbo‑instruct is much better at chess than open models and newer chat‑tuned models.
Thread consensus leans away from “hidden external engine” cheating and toward “biased training data” and possible regressions from instruction‑tuning/RLHF.
Some suggest explicit chess RL, specialized adapters, or domain‑specific sub‑models, but this remains speculative.

Prompting, Evaluation & Alternatives

Regurgitating the full move list each turn often boosts performance; providing legal‑move lists or extra constraints sometimes degrades it.
Suggestions include:
- Asking for analysis/plan before a move (chain‑of‑thought style).
- Using structured tags/XML, ASCII boards, or explicit board descriptions.
- Testing on random legal positions, weird puzzles, or rule‑changed variants to probe generalization.
Others propose combining LLMs with rule checkers or constrained decoding so the model “thinks” about moves but a chess engine enforces legality.

Trust, Hype, and Appropriate Standards

Several comments note a growing mistrust of AI vendors and a tendency to over‑ or under‑claim “reasoning.”
Some argue LLMs should be judged against average human competence and used as fallible tools; others insist that for formal domains (chess, medicine, safety‑critical code) even low error rates and opaque failure modes are unacceptable without strict safeguards.

Related topics