LLMs, Theory of Mind, and Cheryl's Birthday
LLMs on Cheryl’s Birthday and Similar Puzzles
- Several commenters report that newer models (e.g., “mini” code-optimized variants, o1-preview, Claude 3.5) can generate correct Python solvers, sometimes on the first try.
- Others note earlier or different models either fail outright, produce empty solutions, or require iterative debugging with user feedback.
- Some stress that the real challenge is writing generic constraint-solving code from the verbal description, not hardcoding the known solution.
- A concern is that code and solutions for this exact puzzle are widely available online (e.g., Rosetta Code), so success may come from retrieval/memorization rather than genuine reasoning.
Reasoning, Memorization, and “Theory of Mind”
- One camp argues the puzzle is mainly a logic/constraint-satisfaction task and not a good test of theory of mind (ToM); even simple logic programs can solve it.
- Others counter that the puzzle does involve modeling different agents’ knowledge states, which is at least ToM-adjacent.
- Multiple comments highlight that LLMs often give correct answers but logically inconsistent or incorrect explanations, interpreted as evidence of memorization over reasoning.
- There’s pushback against using a puzzle many humans fail as a ToM litmus test: failure doesn’t imply absence of ToM in humans or machines.
Benchmark Spoiling and Evaluation Methodology
- Commenters note that once a “bellwether” puzzle becomes famous, future models may be specifically trained or RL-tuned on it, making it useless as a reasoning benchmark.
- Some describe designing new river-crossing variants and other riddles; LLMs tend to handle canonical versions but break on subtle twists or extra irrelevant constraints.
- There’s discussion of randomness, prompt sensitivity, and the need for multi-sample evaluation rather than single anecdotes.
- Others emphasize that models are fundamentally text predictors; adding interpreters or external tools improves reliability but also reveals their pattern-following nature.
Broader Views on LLM Capabilities
- Enthusiastic voices claim modern LLMs are already “smarter than the average human” on many practical cognitive tasks, and goalposts for “AI” keep moving.
- Skeptical voices argue current architectures hit reasoning/generalization limits (e.g., beyond roughly linear-complexity tasks), lack robust world models, and are overhyped as “intelligent.”
- Several suggest treating LLMs as powerful but non-reasoning tools—akin to calculators or spreadsheets—rather than minds, while still recognizing their transformative practical impact.