Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Study design, toy models, and extrapolation

  • Major critique: the paper uses a tiny GPT-2–style model (4 layers, 32 dims), and media stories implicitly generalize to frontier LLMs, which some find “useless” or misleading.
  • Others argue small-model studies are valid if paradigm is the same, and that scale mostly changes performance, not the underlying mechanism.
  • There’s disagreement whether results reliably extrapolate: some say “model size is a trivial parameter,” others point to depth–sequence-length results suggesting shallow transformers fundamentally can’t do some tasks.
  • Debate on “emergence”: some see qualitative shifts at scale; others say this is just better interpolation, not a new capability.

Synthetic data, cyclic training, and “collapse”

  • One subthread clarifies: “training on LLM outputs once” (synthetic augmentation) vs cyclic self-training on one’s own generations are different phenomena.
  • Prior “model collapse” coverage is criticized as sensationalist and based on toy setups. RL-style methods (RLAIF/GRPO) are cited as safely “training on own data” when grounded in external truth signals.

Reasoning vs pattern simulation

  • Many accept that CoT often produces fluent, plausible “reasoning-like” text whose steps don’t reliably match conclusions or reality.
  • One camp says that’s exactly what “sophisticated simulators of reasoning-like text” means; another says this is just how the probabilistic search process works, and calling it “reasoning” is as acceptable as saying a chess engine “values material.”
  • Some insist LLMs “just predict text” with no concepts or understanding; others recount strong experiences (complex math algorithms, custom scheduling, novel research domains) as evidence of nontrivial reasoning-like generalization.

Out-of-domain tests and known weaknesses

  • The letter-rotation / symbol-permutation tasks are noted as a known weak spot for token-based models.
  • Supporters say that’s the point: the model can verbally explain the task yet still fail to apply it, suggesting the internal “chain of thought” isn’t a genuine algorithm.
  • Counterarguments liken this to human dyslexia or perceptual limits: failure on a particular substrate doesn’t prove absence of reasoning.

Hype, marketing, and public understanding

  • Several comments stress the paper’s value as a corrective to marketing that equates LLMs with robust human-like reasoning and promises white‑collar automation.
  • Media and platform incentives are blamed for overhyping “reasoning” and “catastrophic collapse” narratives alike.
  • There’s talk of a coming “trough of disillusionment,” but also of substantial real productivity gains in coding, and frustration with LLM-generated noise in support and communication workflows.