Understanding the Limitations of Mathematical Reasoning in LLMs

Nature of “reasoning” in LLMs

  • Many commenters see current models as doing “associative” or pattern-based reasoning: they match to seen templates and local patterns, not robust logical calculus.
  • Others argue that with enough training, models learn compressed internal procedures that function as reasoning, even if implemented via statistics over tokens.
  • There is debate over definitions: some insist “reasoning” should mean formal, verifiable deduction; others say if next-token prediction reliably yields correct chains of thought, the label is mostly philosophical.

Benchmarks, perturbations, and overfitting

  • The discussed paper and related work show large drops in accuracy when:
    • Irrelevant details are added to word problems.
    • Numbers or surface phrasing are changed but underlying structure is the same.
  • Some see this as evidence of overfitting and benchmark gaming (Goodhart’s Law, possible training contamination, RLHF targeted at famous puzzles).
  • Others note newer models (e.g., reasoning-optimized ones) degrade less, and claim the paper overemphasizes small or older models to argue for fundamental limits.

Concrete successes and failures

  • Examples where models succeed: classic “Alice’s siblings” puzzle, some nontrivial algebra, and advanced functional analysis guidance that a math PhD student finds genuinely helpful.
  • Examples where they fail:
    • Slightly modified river-crossing puzzles and family riddles.
    • Misinterpreting a doctor riddle when pronouns change.
    • Simple arithmetic/algebra (e.g., decimal comparisons, binomial expansion) in some settings.
    • Struggling with irrelevant clauses in GSM-like problems and arbitrary-length exact computation.

Comparison with human reasoning

  • Several note strong similarities to average students: performance collapses when multiple steps, distracting details, or unfamiliar phrasings are introduced.
  • Others stress key differences:
    • Humans can notice distraction, withhold answers, and update their understanding across a conversation.
    • LLMs remain confidently wrong and often revert to earlier misinterpretations.
  • One prominent mathematician is quoted as likening a top model to a “mediocre but not incompetent” grad student that still needs heavy hinting.

Architecture, data, and tools

  • Some blame tokenization of numbers, finite depth/attention, and fixed compute per token for reasoning brittleness; others say these can be mitigated with better schemes or external tools.
  • Synthetic math data is widely discussed: easy to generate for formal math, harder for genuine quantitative reasoning; current gains on benchmarks are incremental, not saturating.
  • Many argue real reliability will come from LLMs embedded in larger systems (tools, solvers, formal languages), not from raw next-token prediction alone.

Implications and outlook

  • Optimists: scaling, improved data, and “slow thinking” variants (explicit chains-of-thought, tool use) will push LLM reasoning past most humans in many domains.
  • Skeptics: core limitations (fragility to irrelevant info, lack of verifiable guarantees, error rates on simple tasks) make them untrustworthy for correctness-critical math and serious agents.
  • Several tie this to the broader AI investment bubble and warn that headline math/olympiad claims may overstate true, general reasoning ability.