Recent results show that LLMs struggle with compositional tasks
Reliability, Arithmetic, and Expectations
- Commenters debate whether 98% accuracy on adding 100‑digit numbers is “impressive” or “atrocious.”
- Critics compare it to ordinary computers (which are effectively perfect); defenders compare it to “generalist” systems (humans or LLMs that can also chat, explain history, etc.).
- Some emphasize throughput and human fatigue: sustaining human‑level 98% accuracy over hundreds of 100‑digit additions is nontrivial.
- LLM nondeterminism and hallucinations clash with people’s mental model of computers as deterministic and correct, making behavior feel disturbingly “human-like.”
Are the Limits “Fundamental”?
- The article’s cited work mostly investigates decoder‑only transformers in single forward passes; several argue this is a narrow setting, not “all transformer‑based LLMs,” let alone agentic systems.
- Others point out that any finite‑depth network has strict computational limits (pigeonhole principle, circuit depth arguments), but chain‑of‑thought and long token streams increase effective computation.
- Gödel/Turing are invoked: some argue they cap what finite LLMs can do; others counter these theorems constrain humans equally and are practically irrelevant compared to complexity/intractability.
Chain-of-Thought, Tools, and Logic Puzzles
- Multiple experiments are shared with Einstein’s “zebra” puzzle and a 5th‑grade algebra grid:
- Newer reasoning models (o1/o3-mini, DeepSeek-R1) often solve them with long chain‑of‑thought; others (older or smaller models) fail or cheat by recalling the canonical answer.
- Concerns about data contamination lead to prompt modifications (renaming entities, permuting clues), with mixed success.
- Some use LLMs to generate Prolog or Z3 code that then solves such puzzles exactly; debate ensues whether “translation + external solver” counts as the LLM solving the task.
Pattern Matching vs Reasoning and Human Differences
- Many frame transformers as powerful pattern matchers over internet text, not true reasoners; chain‑of‑thought is seen as forcing them to search their internal space more effectively.
- Discussions highlight gaps with humans: embodiment and continuous real‑world feedback, selective and specialized learning, symbolic reasoning over explicit rules, and something akin to a “limbic system” or dynamic reward structure.
- Training on noisy web data versus structured “textbook‑style” corpora is raised as a key limitation on deep, expert‑level reasoning.
Progress, Benchmarks, and Hype
- Some claim that purported “fundamental limitations” keep getting erased by new models (e.g., o3’s strong ARC‑AGI scores), while practical user experience hasn’t changed as dramatically.
- Others stress that the article’s main paper is from 2023 and evaluated GPT‑3/3.5/early‑4, now viewed as “ancient,” so its empirical claims shouldn’t be read as the current frontier.
- There is broad agreement that formalizing LLM failure modes on compositional tasks is valuable, but disagreement on how much these results constrain next‑generation, tool‑augmented systems.