Chain-of-thought can hurt performance on tasks where thinking makes humans worse

Where Chain-of-Thought (CoT) Helps vs Hurts

  • CoT is widely reported (and in the paper) to improve many tasks, especially complex reasoning and code generation.
  • The new result: on some tasks—implicit statistical learning, visual recognition, and pattern classification with exceptions—forcing step-by-step reasoning can significantly reduce accuracy.
  • Some commenters liken this to “don’t overthink it”: for tasks optimized for fast pattern recognition, serial verbal reasoning can interfere with strong implicit representations.
  • Others note that CoT also adds major inference cost, undermining the “train once, cheap inference forever” promise.

Human Cognition Parallels (Overthinking, Muscle Memory)

  • Many draw analogies to humans:
    • Sports, catching balls, pool, and motor skills get worse when you consciously micromanage movements instead of relying on muscle memory / implicit learning.
    • Flow states vs self-conscious analysis in athletics and creativity.
    • Grammar judgments and password recall that degrade when you try to verbalize each step.
  • These are framed as evidence that explicit reasoning can disrupt optimized implicit processes in both brains and models.

Do LLMs ‘Reason’ or Just Predict Tokens?

  • One camp: LLMs are “just” next-token predictors / compressed internet; CoT cannot create information that isn’t there, only rephrases patterns.
  • Opposing camp: next-token prediction doesn’t preclude reasoning; humans may also be sophisticated predictors. Good performance on math, code, and logic-like tasks is cited as evidence of emergent reasoning.
  • Disagreement over whether mathematical or scientific breakthroughs are necessary to count as “real reasoning,” and whether current models are fundamentally incapable of such leaps.

World Models, Semantics, and Plato’s Cave

  • Some argue LLMs lack true world models, ontology, and grounded semantics; they manipulate symbols without experiential contact with reality.
  • Others cite research suggesting internal “world-like” representations (e.g., in games, demographics, physics-like structure) emerge because they improve prediction.
  • A recurring metaphor: LLMs operate on “word models,” akin to prisoners in Plato’s Cave inferring the world from shadows (text), not direct experience.

AGI Prospects and Local Maxima

  • Several commenters see LLMs as a local maximum, not a path to AGI: no persistent memory, no embodied world modeling, heavy dependence on static training.
  • Others think LLMs (or LLM-like components) will remain central building blocks of more general systems, especially when combined with tools, memory, and multimodal input.
  • The thread reflects strong skepticism about AGI timelines, but also recognition of LLMs’ surprising versatility and economic value.

Benchmarks, Robustness, and Reproducibility

  • Some stress that CoT often empirically improves code and math evals; the paper’s negative results are framed as task-specific, not universal.
  • Others criticize LLM research for small or opaque datasets, lack of released code/data, and sensitivity to minor prompt changes (e.g., name swaps, irrelevant text).
  • There is interest in more systematic, adversarial benchmarks that probe robustness to superficial variations and clarify when CoT helps vs harms.