2025-02-02

Recent results show that LLMs struggle with compositional tasks

Reliability, Arithmetic, and Expectations

Commenters debate whether 98% accuracy on adding 100‑digit numbers is “impressive” or “atrocious.”
Critics compare it to ordinary computers (which are effectively perfect); defenders compare it to “generalist” systems (humans or LLMs that can also chat, explain history, etc.).
Some emphasize throughput and human fatigue: sustaining human‑level 98% accuracy over hundreds of 100‑digit additions is nontrivial.
LLM nondeterminism and hallucinations clash with people’s mental model of computers as deterministic and correct, making behavior feel disturbingly “human-like.”

Are the Limits “Fundamental”?

The article’s cited work mostly investigates decoder‑only transformers in single forward passes; several argue this is a narrow setting, not “all transformer‑based LLMs,” let alone agentic systems.
Others point out that any finite‑depth network has strict computational limits (pigeonhole principle, circuit depth arguments), but chain‑of‑thought and long token streams increase effective computation.
Gödel/Turing are invoked: some argue they cap what finite LLMs can do; others counter these theorems constrain humans equally and are practically irrelevant compared to complexity/intractability.

Chain-of-Thought, Tools, and Logic Puzzles

Multiple experiments are shared with Einstein’s “zebra” puzzle and a 5th‑grade algebra grid:
- Newer reasoning models (o1/o3-mini, DeepSeek-R1) often solve them with long chain‑of‑thought; others (older or smaller models) fail or cheat by recalling the canonical answer.
- Concerns about data contamination lead to prompt modifications (renaming entities, permuting clues), with mixed success.
Some use LLMs to generate Prolog or Z3 code that then solves such puzzles exactly; debate ensues whether “translation + external solver” counts as the LLM solving the task.

Pattern Matching vs Reasoning and Human Differences

Many frame transformers as powerful pattern matchers over internet text, not true reasoners; chain‑of‑thought is seen as forcing them to search their internal space more effectively.
Discussions highlight gaps with humans: embodiment and continuous real‑world feedback, selective and specialized learning, symbolic reasoning over explicit rules, and something akin to a “limbic system” or dynamic reward structure.
Training on noisy web data versus structured “textbook‑style” corpora is raised as a key limitation on deep, expert‑level reasoning.

Progress, Benchmarks, and Hype

Some claim that purported “fundamental limitations” keep getting erased by new models (e.g., o3’s strong ARC‑AGI scores), while practical user experience hasn’t changed as dramatically.
Others stress that the article’s main paper is from 2023 and evaluated GPT‑3/3.5/early‑4, now viewed as “ancient,” so its empirical claims shouldn’t be read as the current frontier.
There is broad agreement that formalizing LLM failure modes on compositional tasks is valuable, but disagreement on how much these results constrain next‑generation, tool‑augmented systems.

Related topics