LLMs, Theory of Mind, and Cheryl's Birthday

LLMs on Cheryl’s Birthday and Similar Puzzles

  • Several commenters report that newer models (e.g., “mini” code-optimized variants, o1-preview, Claude 3.5) can generate correct Python solvers, sometimes on the first try.
  • Others note earlier or different models either fail outright, produce empty solutions, or require iterative debugging with user feedback.
  • Some stress that the real challenge is writing generic constraint-solving code from the verbal description, not hardcoding the known solution.
  • A concern is that code and solutions for this exact puzzle are widely available online (e.g., Rosetta Code), so success may come from retrieval/memorization rather than genuine reasoning.

Reasoning, Memorization, and “Theory of Mind”

  • One camp argues the puzzle is mainly a logic/constraint-satisfaction task and not a good test of theory of mind (ToM); even simple logic programs can solve it.
  • Others counter that the puzzle does involve modeling different agents’ knowledge states, which is at least ToM-adjacent.
  • Multiple comments highlight that LLMs often give correct answers but logically inconsistent or incorrect explanations, interpreted as evidence of memorization over reasoning.
  • There’s pushback against using a puzzle many humans fail as a ToM litmus test: failure doesn’t imply absence of ToM in humans or machines.

Benchmark Spoiling and Evaluation Methodology

  • Commenters note that once a “bellwether” puzzle becomes famous, future models may be specifically trained or RL-tuned on it, making it useless as a reasoning benchmark.
  • Some describe designing new river-crossing variants and other riddles; LLMs tend to handle canonical versions but break on subtle twists or extra irrelevant constraints.
  • There’s discussion of randomness, prompt sensitivity, and the need for multi-sample evaluation rather than single anecdotes.
  • Others emphasize that models are fundamentally text predictors; adding interpreters or external tools improves reliability but also reveals their pattern-following nature.

Broader Views on LLM Capabilities

  • Enthusiastic voices claim modern LLMs are already “smarter than the average human” on many practical cognitive tasks, and goalposts for “AI” keep moving.
  • Skeptical voices argue current architectures hit reasoning/generalization limits (e.g., beyond roughly linear-complexity tasks), lack robust world models, and are overhyped as “intelligent.”
  • Several suggest treating LLMs as powerful but non-reasoning tools—akin to calculators or spreadsheets—rather than minds, while still recognizing their transformative practical impact.