2024-10-30

Chain-of-thought can hurt performance on tasks where thinking makes humans worse

Where Chain-of-Thought (CoT) Helps vs Hurts

CoT is widely reported (and in the paper) to improve many tasks, especially complex reasoning and code generation.
The new result: on some tasks—implicit statistical learning, visual recognition, and pattern classification with exceptions—forcing step-by-step reasoning can significantly reduce accuracy.
Some commenters liken this to “don’t overthink it”: for tasks optimized for fast pattern recognition, serial verbal reasoning can interfere with strong implicit representations.
Others note that CoT also adds major inference cost, undermining the “train once, cheap inference forever” promise.

Human Cognition Parallels (Overthinking, Muscle Memory)

Many draw analogies to humans:
- Sports, catching balls, pool, and motor skills get worse when you consciously micromanage movements instead of relying on muscle memory / implicit learning.
- Flow states vs self-conscious analysis in athletics and creativity.
- Grammar judgments and password recall that degrade when you try to verbalize each step.
These are framed as evidence that explicit reasoning can disrupt optimized implicit processes in both brains and models.

Do LLMs ‘Reason’ or Just Predict Tokens?

One camp: LLMs are “just” next-token predictors / compressed internet; CoT cannot create information that isn’t there, only rephrases patterns.
Opposing camp: next-token prediction doesn’t preclude reasoning; humans may also be sophisticated predictors. Good performance on math, code, and logic-like tasks is cited as evidence of emergent reasoning.
Disagreement over whether mathematical or scientific breakthroughs are necessary to count as “real reasoning,” and whether current models are fundamentally incapable of such leaps.

World Models, Semantics, and Plato’s Cave

Some argue LLMs lack true world models, ontology, and grounded semantics; they manipulate symbols without experiential contact with reality.
Others cite research suggesting internal “world-like” representations (e.g., in games, demographics, physics-like structure) emerge because they improve prediction.
A recurring metaphor: LLMs operate on “word models,” akin to prisoners in Plato’s Cave inferring the world from shadows (text), not direct experience.

AGI Prospects and Local Maxima

Several commenters see LLMs as a local maximum, not a path to AGI: no persistent memory, no embodied world modeling, heavy dependence on static training.
Others think LLMs (or LLM-like components) will remain central building blocks of more general systems, especially when combined with tools, memory, and multimodal input.
The thread reflects strong skepticism about AGI timelines, but also recognition of LLMs’ surprising versatility and economic value.

Benchmarks, Robustness, and Reproducibility

Some stress that CoT often empirically improves code and math evals; the paper’s negative results are framed as task-specific, not universal.
Others criticize LLM research for small or opaque datasets, lack of released code/data, and sensitivity to minor prompt changes (e.g., name swaps, irrelevant text).
There is interest in more systematic, adversarial benchmarks that probe robustness to superficial variations and clarify when CoT helps vs harms.

Related topics