Chain-of-thought can hurt performance on tasks where thinking makes humans worse
Where Chain-of-Thought (CoT) Helps vs Hurts
- CoT is widely reported (and in the paper) to improve many tasks, especially complex reasoning and code generation.
- The new result: on some tasks—implicit statistical learning, visual recognition, and pattern classification with exceptions—forcing step-by-step reasoning can significantly reduce accuracy.
- Some commenters liken this to “don’t overthink it”: for tasks optimized for fast pattern recognition, serial verbal reasoning can interfere with strong implicit representations.
- Others note that CoT also adds major inference cost, undermining the “train once, cheap inference forever” promise.
Human Cognition Parallels (Overthinking, Muscle Memory)
- Many draw analogies to humans:
- Sports, catching balls, pool, and motor skills get worse when you consciously micromanage movements instead of relying on muscle memory / implicit learning.
- Flow states vs self-conscious analysis in athletics and creativity.
- Grammar judgments and password recall that degrade when you try to verbalize each step.
- These are framed as evidence that explicit reasoning can disrupt optimized implicit processes in both brains and models.
Do LLMs ‘Reason’ or Just Predict Tokens?
- One camp: LLMs are “just” next-token predictors / compressed internet; CoT cannot create information that isn’t there, only rephrases patterns.
- Opposing camp: next-token prediction doesn’t preclude reasoning; humans may also be sophisticated predictors. Good performance on math, code, and logic-like tasks is cited as evidence of emergent reasoning.
- Disagreement over whether mathematical or scientific breakthroughs are necessary to count as “real reasoning,” and whether current models are fundamentally incapable of such leaps.
World Models, Semantics, and Plato’s Cave
- Some argue LLMs lack true world models, ontology, and grounded semantics; they manipulate symbols without experiential contact with reality.
- Others cite research suggesting internal “world-like” representations (e.g., in games, demographics, physics-like structure) emerge because they improve prediction.
- A recurring metaphor: LLMs operate on “word models,” akin to prisoners in Plato’s Cave inferring the world from shadows (text), not direct experience.
AGI Prospects and Local Maxima
- Several commenters see LLMs as a local maximum, not a path to AGI: no persistent memory, no embodied world modeling, heavy dependence on static training.
- Others think LLMs (or LLM-like components) will remain central building blocks of more general systems, especially when combined with tools, memory, and multimodal input.
- The thread reflects strong skepticism about AGI timelines, but also recognition of LLMs’ surprising versatility and economic value.
Benchmarks, Robustness, and Reproducibility
- Some stress that CoT often empirically improves code and math evals; the paper’s negative results are framed as task-specific, not universal.
- Others criticize LLM research for small or opaque datasets, lack of released code/data, and sensitivity to minor prompt changes (e.g., name swaps, irrelevant text).
- There is interest in more systematic, adversarial benchmarks that probe robustness to superficial variations and clarify when CoT helps vs harms.