2025-09-03

The wall confronting large language models

Paper accessibility and author expertise

Many commenters find the paper hard to read: heavy prose, dense equations, few concrete examples.
Debate over whether the authors are “outside their core field”: some see computational physics/chemistry as relevant to ML; others view lack of LLM-building experience as a credibility issue.
Meta‑discussion about gatekeeping: some argue ideas should stand on merit, others stress that bold claims from non‑practitioners deserve extra skepticism.

The “wall” and scaling of LLMs

Several readers think core LLM quality gains have slowed despite massive spend, suggesting we may be near the top of an S‑curve.
Others counter with business metrics (revenue growth) and argue the paper is about capability scaling, not value-for-money.
Some expect future improvements more from agents, tools, and hybrid systems than from monolithic model scaling.

Markov chains, formal models, and expressivity

One thread explores an “extensional equivalence” between LLMs and high‑order Markov chains.
Critics say this equivalence is either trivial (any finite computation can be embedded in a huge Markov chain) or irrelevant to practical limits.
Disagreement over whether such reductions actually constrain what transformers can do, or just restate that high‑dimensional probabilistic dynamics are very expressive.

Symbolic reasoning, backtracking, and Prolog

A long subthread argues that probabilistic sequence models fundamentally lack capabilities like logical backtracking and Prolog‑style search.
Others respond that backtracking can be simulated either inside the token stream or via external loops/tools; the bottleneck is practicality, not theoretical impossibility.
Sudoku and Prolog interpreters are used as test cases; debate centers on whether “LLM + scaffolding” counts as the model doing the reasoning.

Turing completeness and “reasoning”

Some argue that once an LLM is embedded in a simple loop, it becomes Turing complete; therefore there is no principled barrier to any computable reasoning.
Opponents say this conflates mere computability with human‑like logical reasoning, invoking analogies to the Chinese Room and stressing reliability and traceability, not bare possibility.

Empirical limitations: math, logic, and hallucinations

Multiple anecdotes show state‑of‑the‑art models still failing at basic arithmetic or producing correct answers via incorrect intermediate steps.
This is taken by skeptics as evidence that “reasoning” is shallow pattern-matching; boosters reply that failures are mostly quantitative (error rates) and improvable.
Some note that as long as outputs must be checked by humans or tools, applicability remains constrained—analogous to perpetually supervised self‑driving cars.

Brain comparisons and energy use

The paper’s brain–LLM comparisons (synapses vs parameters, 20 W vs gigawatts) are criticized as superficial: humans could never ingest LLM training corpora, and inference energy per user is much lower than training.
Others emphasize that, despite lower data and energy, humans still vastly outperform LLMs in flexible, grounded reasoning.

Critique of specific technical analogies

The focus on floating‑point precision and discrete derivatives is questioned: commenters argue high‑dimensional optimization behaves differently than the paper suggests, and SGD’s success in such spaces is underappreciated.
Repeated references to nuclear reactors and numerical analysis strike some readers as forced or only loosely connected to real LLM training dynamics.

Alternative directions and ML theory

Some participants see the paper as broadly right in spirit—LLMs will hit walls on deeper reasoning—and are exploring symbolic, Bayesian, or neuro‑symbolic systems as complements.
Others highlight a large but less visible body of ML theory and limits work; they worry hype around LLMs is crowding out more rigorous, long‑term lines of research.

Related topics