OK, I can partly explain the LLM chess weirdness now

LLM Chess Ability & Illegal Moves

  • Many note that GPT‑3.5‑turbo‑instruct plays around lichess ~1750–1800 level and rarely makes illegal moves, which is non‑trivial given it only sees text move lists, not board state.
  • Others push back: even a single illegal move in a purely rules‑based game shows it hasn’t fully internalized the rules, especially compared to human players of similar rating who almost never make illegal moves with a visible board.
  • Several distinguish “chessy” illegal moves (subtle king‑in‑check issues) from blatantly impossible ones (moving non‑existent pieces or off‑board squares); skeptics say current LLMs still do the latter.

Understanding, Reasoning, and “Just Token Prediction”

  • One group argues that effective next‑token prediction over complex domains (chess, math, code, word problems) is a form of reasoning and world‑modeling, even if implemented via statistics.
  • Skeptics counter that:
    • Models fail badly on modified or unfamiliar tasks, or when irrelevant info is added.
    • They hallucinate reasoning steps post‑hoc and can regurgitate training data.
    • “Appearing to reason” is not the same as reliable, systematic reasoning.
  • Debate extends into Turing‑completeness: some say being Turing‑complete means, in principle, LLMs could implement reasoning; others call that practically irrelevant given efficiency and reliability constraints.

Training Data, Fine‑Tuning & Model Differences

  • A widely supported hypothesis: OpenAI trained some base models on many high‑quality chess games (e.g., Elo ≥1800 PGNs). That explains why GPT‑3.5‑turbo‑instruct is much better at chess than open models and newer chat‑tuned models.
  • Thread consensus leans away from “hidden external engine” cheating and toward “biased training data” and possible regressions from instruction‑tuning/RLHF.
  • Some suggest explicit chess RL, specialized adapters, or domain‑specific sub‑models, but this remains speculative.

Prompting, Evaluation & Alternatives

  • Regurgitating the full move list each turn often boosts performance; providing legal‑move lists or extra constraints sometimes degrades it.
  • Suggestions include:
    • Asking for analysis/plan before a move (chain‑of‑thought style).
    • Using structured tags/XML, ASCII boards, or explicit board descriptions.
    • Testing on random legal positions, weird puzzles, or rule‑changed variants to probe generalization.
  • Others propose combining LLMs with rule checkers or constrained decoding so the model “thinks” about moves but a chess engine enforces legality.

Trust, Hype, and Appropriate Standards

  • Several comments note a growing mistrust of AI vendors and a tendency to over‑ or under‑claim “reasoning.”
  • Some argue LLMs should be judged against average human competence and used as fallible tools; others insist that for formal domains (chess, medicine, safety‑critical code) even low error rates and opaque failure modes are unacceptable without strict safeguards.