2024-06-05

Simple tasks showing reasoning breakdown in state-of-the-art LLMs

Core puzzle and results

Thread centers on the “Alice has N brothers and M sisters; how many sisters does Alice’s brother have?” puzzle from the paper.
Key finding discussed: GPT‑4o and similar models often answer incorrectly, especially when forced to output only the final number; success rates around ~60% are mentioned.
A harder “AIW+” family‑relations variant is acknowledged as non‑trivial even for humans.

Prompting, chain-of-thought, and constraints

Many commenters observe that models do better when allowed to “think out loud” or show step‑by‑step reasoning.
The paper’s “RESTRICTED” prompt (answer-only, no reasoning) is criticized by some as artificially limiting computation; others note the paper also tested more generous prompts.
Several people argue that if small changes in phrasing or format break reasoning, that’s itself a reliability problem for real-world use.
Experiments show models can sometimes correct themselves when explicitly asked to reconsider or to reason about possible inconsistencies.

Reasoning vs pattern matching

One camp asserts LLMs are essentially sophisticated statistical parrots or “semantic compression machines,” not genuine reasoners, and this puzzle exposes that.
Another camp claims that next‑token prediction over massive textual corpora effectively forces models to simulate some forms of human reasoning, even if the mechanism is very different.
There is extended debate over what “reasoning” even means, and whether looking only at the mechanics of transformers is sufficient to declare the absence of reasoning.

Human comparison and benchmarks

Multiple commenters question the claim that the puzzle is “simple,” suggesting many non‑technical humans would also fail without careful thought.
Others want empirical human baselines, not assumptions, and note that people also confabulate confident but wrong explanations.
Benchmarks like MMLU are criticized as contaminated by training data and weak on genuine reasoning; calls are made for fresh, truly out‑of‑distribution tests.

Augmenting LLMs and future directions

Several propose coupling LLMs to formal reasoning tools (Prolog, theorem provers, code execution) where the model translates text into logic/programs and delegates exact reasoning.
Anecdotes show GPT variants can write Prolog encodings of the family puzzle and then get the correct result via execution, suggesting a hybrid path forward.
Broader meta‑discussion: LLMs are powerful and useful but unreliable; hype about near‑term AGI should be tempered by systematic demonstrations of such failures.

Related topics