2024-10-11

Understanding the Limitations of Mathematical Reasoning in LLMs

Nature of “reasoning” in LLMs

Many commenters see current models as doing “associative” or pattern-based reasoning: they match to seen templates and local patterns, not robust logical calculus.
Others argue that with enough training, models learn compressed internal procedures that function as reasoning, even if implemented via statistics over tokens.
There is debate over definitions: some insist “reasoning” should mean formal, verifiable deduction; others say if next-token prediction reliably yields correct chains of thought, the label is mostly philosophical.

Benchmarks, perturbations, and overfitting

The discussed paper and related work show large drops in accuracy when:
- Irrelevant details are added to word problems.
- Numbers or surface phrasing are changed but underlying structure is the same.
Some see this as evidence of overfitting and benchmark gaming (Goodhart’s Law, possible training contamination, RLHF targeted at famous puzzles).
Others note newer models (e.g., reasoning-optimized ones) degrade less, and claim the paper overemphasizes small or older models to argue for fundamental limits.

Concrete successes and failures

Examples where models succeed: classic “Alice’s siblings” puzzle, some nontrivial algebra, and advanced functional analysis guidance that a math PhD student finds genuinely helpful.
Examples where they fail:
- Slightly modified river-crossing puzzles and family riddles.
- Misinterpreting a doctor riddle when pronouns change.
- Simple arithmetic/algebra (e.g., decimal comparisons, binomial expansion) in some settings.
- Struggling with irrelevant clauses in GSM-like problems and arbitrary-length exact computation.

Comparison with human reasoning

Several note strong similarities to average students: performance collapses when multiple steps, distracting details, or unfamiliar phrasings are introduced.
Others stress key differences:
- Humans can notice distraction, withhold answers, and update their understanding across a conversation.
- LLMs remain confidently wrong and often revert to earlier misinterpretations.
One prominent mathematician is quoted as likening a top model to a “mediocre but not incompetent” grad student that still needs heavy hinting.

Architecture, data, and tools

Some blame tokenization of numbers, finite depth/attention, and fixed compute per token for reasoning brittleness; others say these can be mitigated with better schemes or external tools.
Synthetic math data is widely discussed: easy to generate for formal math, harder for genuine quantitative reasoning; current gains on benchmarks are incremental, not saturating.
Many argue real reliability will come from LLMs embedded in larger systems (tools, solvers, formal languages), not from raw next-token prediction alone.

Implications and outlook

Optimists: scaling, improved data, and “slow thinking” variants (explicit chains-of-thought, tool use) will push LLM reasoning past most humans in many domains.
Skeptics: core limitations (fragility to irrelevant info, lack of verifiable guarantees, error rates on simple tasks) make them untrustworthy for correctness-critical math and serious agents.
Several tie this to the broader AI investment bubble and warn that headline math/olympiad claims may overstate true, general reasoning ability.

Related topics