2024-10-04

LLMs, Theory of Mind, and Cheryl's Birthday

LLMs on Cheryl’s Birthday and Similar Puzzles

Several commenters report that newer models (e.g., “mini” code-optimized variants, o1-preview, Claude 3.5) can generate correct Python solvers, sometimes on the first try.
Others note earlier or different models either fail outright, produce empty solutions, or require iterative debugging with user feedback.
Some stress that the real challenge is writing generic constraint-solving code from the verbal description, not hardcoding the known solution.
A concern is that code and solutions for this exact puzzle are widely available online (e.g., Rosetta Code), so success may come from retrieval/memorization rather than genuine reasoning.

Reasoning, Memorization, and “Theory of Mind”

One camp argues the puzzle is mainly a logic/constraint-satisfaction task and not a good test of theory of mind (ToM); even simple logic programs can solve it.
Others counter that the puzzle does involve modeling different agents’ knowledge states, which is at least ToM-adjacent.
Multiple comments highlight that LLMs often give correct answers but logically inconsistent or incorrect explanations, interpreted as evidence of memorization over reasoning.
There’s pushback against using a puzzle many humans fail as a ToM litmus test: failure doesn’t imply absence of ToM in humans or machines.

Benchmark Spoiling and Evaluation Methodology

Commenters note that once a “bellwether” puzzle becomes famous, future models may be specifically trained or RL-tuned on it, making it useless as a reasoning benchmark.
Some describe designing new river-crossing variants and other riddles; LLMs tend to handle canonical versions but break on subtle twists or extra irrelevant constraints.
There’s discussion of randomness, prompt sensitivity, and the need for multi-sample evaluation rather than single anecdotes.
Others emphasize that models are fundamentally text predictors; adding interpreters or external tools improves reliability but also reveals their pattern-following nature.

Broader Views on LLM Capabilities

Enthusiastic voices claim modern LLMs are already “smarter than the average human” on many practical cognitive tasks, and goalposts for “AI” keep moving.
Skeptical voices argue current architectures hit reasoning/generalization limits (e.g., beyond roughly linear-complexity tasks), lack robust world models, and are overhyped as “intelligent.”
Several suggest treating LLMs as powerful but non-reasoning tools—akin to calculators or spreadsheets—rather than minds, while still recognizing their transformative practical impact.

Related topics