2024-10-11

LLMs don't do formal reasoning

Original Article ↗ Hacker News Discussion ↗

Scope of the Critique

Thread centers on whether current LLMs can perform formal reasoning, not just produce plausible text.
Many agree LLMs struggle with small wording changes, irrelevant details, and systematic generalization, especially in math/logic word problems.
Others argue that highlighting failures is overdone and often ignores their clear practical utility.

Usefulness vs. Reliability

Several commenters emphasize LLMs are already very useful tools for coding, research assistance, and everyday tasks, even if not perfectly reliable.
Critics counter that for agent-like systems or high-stakes domains, error rates and brittleness to small prompt changes are unacceptable.
Consensus: LLMs can be valuable, but they are not yet dependable foundations for fully autonomous “reasoning agents.”

Human vs. LLM Reasoning

One camp notes humans also reason poorly, fall for trick questions, and rely on heuristics.
Another camp stresses that even average humans can learn rule-following (e.g., legal chess moves, routine professional reasoning) that current LLMs often fail to match consistently.
There is disagreement over whether human and LLM reasoning are fundamentally different or just different points on a spectrum.

Architecture and Limits

Multiple comments highlight that transformers give each token a fixed computation budget; complexity of the problem does not change internal depth.
This encourages pattern-matching over stepwise reasoning and limits extrapolation and long-chain problem solving.
Some point to newer “reasoning” models (e.g., with hidden chains-of-thought or tool use) as partial progress but still failure-prone.

Benchmarks and the Kiwi Problem

A simple kiwi-counting word problem is discussed: some reports show models failing; others easily reproduce correct answers with current versions.
This raises issues of stochastic outputs, prompt sensitivity, and possible benchmark contamination, but also shows incremental improvements.

Hybrid and Formal Methods Directions

Several propose using LLMs as front-ends: translating messy natural language into formal representations (SMT, ASP, theorem provers, calculators), then back to language.
Others describe or propose neuro-symbolic architectures and richer tokenization schemes to track logic, referents, and possible worlds.
There is broad interest in hybrid systems where symbolic methods provide rigor and LLMs handle language and glue logic.