OK, I can partly explain the LLM chess weirdness now
LLM Chess Ability & Illegal Moves
- Many note that GPT‑3.5‑turbo‑instruct plays around lichess ~1750–1800 level and rarely makes illegal moves, which is non‑trivial given it only sees text move lists, not board state.
- Others push back: even a single illegal move in a purely rules‑based game shows it hasn’t fully internalized the rules, especially compared to human players of similar rating who almost never make illegal moves with a visible board.
- Several distinguish “chessy” illegal moves (subtle king‑in‑check issues) from blatantly impossible ones (moving non‑existent pieces or off‑board squares); skeptics say current LLMs still do the latter.
Understanding, Reasoning, and “Just Token Prediction”
- One group argues that effective next‑token prediction over complex domains (chess, math, code, word problems) is a form of reasoning and world‑modeling, even if implemented via statistics.
- Skeptics counter that:
- Models fail badly on modified or unfamiliar tasks, or when irrelevant info is added.
- They hallucinate reasoning steps post‑hoc and can regurgitate training data.
- “Appearing to reason” is not the same as reliable, systematic reasoning.
- Debate extends into Turing‑completeness: some say being Turing‑complete means, in principle, LLMs could implement reasoning; others call that practically irrelevant given efficiency and reliability constraints.
Training Data, Fine‑Tuning & Model Differences
- A widely supported hypothesis: OpenAI trained some base models on many high‑quality chess games (e.g., Elo ≥1800 PGNs). That explains why GPT‑3.5‑turbo‑instruct is much better at chess than open models and newer chat‑tuned models.
- Thread consensus leans away from “hidden external engine” cheating and toward “biased training data” and possible regressions from instruction‑tuning/RLHF.
- Some suggest explicit chess RL, specialized adapters, or domain‑specific sub‑models, but this remains speculative.
Prompting, Evaluation & Alternatives
- Regurgitating the full move list each turn often boosts performance; providing legal‑move lists or extra constraints sometimes degrades it.
- Suggestions include:
- Asking for analysis/plan before a move (chain‑of‑thought style).
- Using structured tags/XML, ASCII boards, or explicit board descriptions.
- Testing on random legal positions, weird puzzles, or rule‑changed variants to probe generalization.
- Others propose combining LLMs with rule checkers or constrained decoding so the model “thinks” about moves but a chess engine enforces legality.
Trust, Hype, and Appropriate Standards
- Several comments note a growing mistrust of AI vendors and a tendency to over‑ or under‑claim “reasoning.”
- Some argue LLMs should be judged against average human competence and used as fallible tools; others insist that for formal domains (chess, medicine, safety‑critical code) even low error rates and opaque failure modes are unacceptable without strict safeguards.