Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Study design, toy models, and extrapolation
- Major critique: the paper uses a tiny GPT-2–style model (4 layers, 32 dims), and media stories implicitly generalize to frontier LLMs, which some find “useless” or misleading.
- Others argue small-model studies are valid if paradigm is the same, and that scale mostly changes performance, not the underlying mechanism.
- There’s disagreement whether results reliably extrapolate: some say “model size is a trivial parameter,” others point to depth–sequence-length results suggesting shallow transformers fundamentally can’t do some tasks.
- Debate on “emergence”: some see qualitative shifts at scale; others say this is just better interpolation, not a new capability.
Synthetic data, cyclic training, and “collapse”
- One subthread clarifies: “training on LLM outputs once” (synthetic augmentation) vs cyclic self-training on one’s own generations are different phenomena.
- Prior “model collapse” coverage is criticized as sensationalist and based on toy setups. RL-style methods (RLAIF/GRPO) are cited as safely “training on own data” when grounded in external truth signals.
Reasoning vs pattern simulation
- Many accept that CoT often produces fluent, plausible “reasoning-like” text whose steps don’t reliably match conclusions or reality.
- One camp says that’s exactly what “sophisticated simulators of reasoning-like text” means; another says this is just how the probabilistic search process works, and calling it “reasoning” is as acceptable as saying a chess engine “values material.”
- Some insist LLMs “just predict text” with no concepts or understanding; others recount strong experiences (complex math algorithms, custom scheduling, novel research domains) as evidence of nontrivial reasoning-like generalization.
Out-of-domain tests and known weaknesses
- The letter-rotation / symbol-permutation tasks are noted as a known weak spot for token-based models.
- Supporters say that’s the point: the model can verbally explain the task yet still fail to apply it, suggesting the internal “chain of thought” isn’t a genuine algorithm.
- Counterarguments liken this to human dyslexia or perceptual limits: failure on a particular substrate doesn’t prove absence of reasoning.
Hype, marketing, and public understanding
- Several comments stress the paper’s value as a corrective to marketing that equates LLMs with robust human-like reasoning and promises white‑collar automation.
- Media and platform incentives are blamed for overhyping “reasoning” and “catastrophic collapse” narratives alike.
- There’s talk of a coming “trough of disillusionment,” but also of substantial real productivity gains in coding, and frustration with LLM-generated noise in support and communication workflows.