LLMs don't do formal reasoning

Scope of the Critique

  • Thread centers on whether current LLMs can perform formal reasoning, not just produce plausible text.
  • Many agree LLMs struggle with small wording changes, irrelevant details, and systematic generalization, especially in math/logic word problems.
  • Others argue that highlighting failures is overdone and often ignores their clear practical utility.

Usefulness vs. Reliability

  • Several commenters emphasize LLMs are already very useful tools for coding, research assistance, and everyday tasks, even if not perfectly reliable.
  • Critics counter that for agent-like systems or high-stakes domains, error rates and brittleness to small prompt changes are unacceptable.
  • Consensus: LLMs can be valuable, but they are not yet dependable foundations for fully autonomous “reasoning agents.”

Human vs. LLM Reasoning

  • One camp notes humans also reason poorly, fall for trick questions, and rely on heuristics.
  • Another camp stresses that even average humans can learn rule-following (e.g., legal chess moves, routine professional reasoning) that current LLMs often fail to match consistently.
  • There is disagreement over whether human and LLM reasoning are fundamentally different or just different points on a spectrum.

Architecture and Limits

  • Multiple comments highlight that transformers give each token a fixed computation budget; complexity of the problem does not change internal depth.
  • This encourages pattern-matching over stepwise reasoning and limits extrapolation and long-chain problem solving.
  • Some point to newer “reasoning” models (e.g., with hidden chains-of-thought or tool use) as partial progress but still failure-prone.

Benchmarks and the Kiwi Problem

  • A simple kiwi-counting word problem is discussed: some reports show models failing; others easily reproduce correct answers with current versions.
  • This raises issues of stochastic outputs, prompt sensitivity, and possible benchmark contamination, but also shows incremental improvements.

Hybrid and Formal Methods Directions

  • Several propose using LLMs as front-ends: translating messy natural language into formal representations (SMT, ASP, theorem provers, calculators), then back to language.
  • Others describe or propose neuro-symbolic architectures and richer tokenization schemes to track logic, referents, and possible worlds.
  • There is broad interest in hybrid systems where symbolic methods provide rigor and LLMs handle language and glue logic.