The Illusion of Thinking: Strengths and limitations of reasoning models [pdf]

What “reasoning” means and what LRMs really are

  • Many commenters argue that “large reasoning models” are just LLMs with extra steps: more context, chain-of-thought, self-refinement, RLHF on problem-solving traces.
  • Disagreement over definitions: some want formal “derive new facts from old ones” (modus ponens, generalizable algorithms), others stress that pattern matching plus heuristics may still look like reasoning in practice.
  • Several note that current “reasoning” is often just a branded version of long-known prompt-engineering tricks; the name oversells what’s actually happening.

Core experimental findings from the paper

  • Puzzles are used because they: avoid training-data contamination, allow controlled complexity, and force explicit logical structure.
  • Three regimes emerge:
    • Low complexity: vanilla LLMs often outperform LRMs; extra “thinking” leads to overcomplication and worse answers.
    • Medium complexity: LRMs do better, but only if allowed many tokens; gains are expensive.
    • High complexity: both LLMs and LRMs collapse to ~0% accuracy; LRMs even reduce their reasoning depth as complexity rises.
  • Even when given the exact algorithm in the prompt, LRMs still need many steps and often fail to follow it consistently.
  • Models appear to solve certain tasks (e.g., large Tower of Hanoi instances) likely via memorized patterns, while failing on similarly structured but unfamiliar puzzles (e.g., river crossing variants).

Implications for AGI and the hype cycle

  • Many see this as evidence of a “complexity wall” that more tokens and compute don’t simply overcome, weakening near-term AGI claims.
  • Comparisons are made to self‑driving cars and fusion: big progress, but generality and robustness stall in long-tail cases.
  • Others remain bullish, viewing this as mapping where current methods break, not a fundamental limit; they expect new architectures, tools, or agents to push the wall back.

Critiques and caveats about the study

  • Some say it mostly measures long-chain adherence, not whether models can invent algorithms; allowing code-writing or tools would trivialize many puzzles.
  • Others note missing or fuzzy definitions (“reasoning”, “generalizable”) and argue that humans also fail catastrophically beyond small N, yet we still say humans reason.

Observed behavior of today’s models & future directions

  • Anecdotes match the paper: “reasoning” models often excel on medium tasks but overthink simple ones and derail on complex ones (coding, Base64, strategy questions).
  • Suggested next steps include neurosymbolic hybrids, explicit logic/optimization backends, agents that decompose problems, more non-linguistic grounding, and better ways to manage or externalize long reasoning chains.