Simple tasks showing reasoning breakdown in state-of-the-art LLMs

Core puzzle and results

  • Thread centers on the “Alice has N brothers and M sisters; how many sisters does Alice’s brother have?” puzzle from the paper.
  • Key finding discussed: GPT‑4o and similar models often answer incorrectly, especially when forced to output only the final number; success rates around ~60% are mentioned.
  • A harder “AIW+” family‑relations variant is acknowledged as non‑trivial even for humans.

Prompting, chain-of-thought, and constraints

  • Many commenters observe that models do better when allowed to “think out loud” or show step‑by‑step reasoning.
  • The paper’s “RESTRICTED” prompt (answer-only, no reasoning) is criticized by some as artificially limiting computation; others note the paper also tested more generous prompts.
  • Several people argue that if small changes in phrasing or format break reasoning, that’s itself a reliability problem for real-world use.
  • Experiments show models can sometimes correct themselves when explicitly asked to reconsider or to reason about possible inconsistencies.

Reasoning vs pattern matching

  • One camp asserts LLMs are essentially sophisticated statistical parrots or “semantic compression machines,” not genuine reasoners, and this puzzle exposes that.
  • Another camp claims that next‑token prediction over massive textual corpora effectively forces models to simulate some forms of human reasoning, even if the mechanism is very different.
  • There is extended debate over what “reasoning” even means, and whether looking only at the mechanics of transformers is sufficient to declare the absence of reasoning.

Human comparison and benchmarks

  • Multiple commenters question the claim that the puzzle is “simple,” suggesting many non‑technical humans would also fail without careful thought.
  • Others want empirical human baselines, not assumptions, and note that people also confabulate confident but wrong explanations.
  • Benchmarks like MMLU are criticized as contaminated by training data and weak on genuine reasoning; calls are made for fresh, truly out‑of‑distribution tests.

Augmenting LLMs and future directions

  • Several propose coupling LLMs to formal reasoning tools (Prolog, theorem provers, code execution) where the model translates text into logic/programs and delegates exact reasoning.
  • Anecdotes show GPT variants can write Prolog encodings of the family puzzle and then get the correct result via execution, suggesting a hybrid path forward.
  • Broader meta‑discussion: LLMs are powerful and useful but unreliable; hype about near‑term AGI should be tempered by systematic demonstrations of such failures.