Recent results show that LLMs struggle with compositional tasks

Reliability, Arithmetic, and Expectations

  • Commenters debate whether 98% accuracy on adding 100‑digit numbers is “impressive” or “atrocious.”
  • Critics compare it to ordinary computers (which are effectively perfect); defenders compare it to “generalist” systems (humans or LLMs that can also chat, explain history, etc.).
  • Some emphasize throughput and human fatigue: sustaining human‑level 98% accuracy over hundreds of 100‑digit additions is nontrivial.
  • LLM nondeterminism and hallucinations clash with people’s mental model of computers as deterministic and correct, making behavior feel disturbingly “human-like.”

Are the Limits “Fundamental”?

  • The article’s cited work mostly investigates decoder‑only transformers in single forward passes; several argue this is a narrow setting, not “all transformer‑based LLMs,” let alone agentic systems.
  • Others point out that any finite‑depth network has strict computational limits (pigeonhole principle, circuit depth arguments), but chain‑of‑thought and long token streams increase effective computation.
  • Gödel/Turing are invoked: some argue they cap what finite LLMs can do; others counter these theorems constrain humans equally and are practically irrelevant compared to complexity/intractability.

Chain-of-Thought, Tools, and Logic Puzzles

  • Multiple experiments are shared with Einstein’s “zebra” puzzle and a 5th‑grade algebra grid:
    • Newer reasoning models (o1/o3-mini, DeepSeek-R1) often solve them with long chain‑of‑thought; others (older or smaller models) fail or cheat by recalling the canonical answer.
    • Concerns about data contamination lead to prompt modifications (renaming entities, permuting clues), with mixed success.
  • Some use LLMs to generate Prolog or Z3 code that then solves such puzzles exactly; debate ensues whether “translation + external solver” counts as the LLM solving the task.

Pattern Matching vs Reasoning and Human Differences

  • Many frame transformers as powerful pattern matchers over internet text, not true reasoners; chain‑of‑thought is seen as forcing them to search their internal space more effectively.
  • Discussions highlight gaps with humans: embodiment and continuous real‑world feedback, selective and specialized learning, symbolic reasoning over explicit rules, and something akin to a “limbic system” or dynamic reward structure.
  • Training on noisy web data versus structured “textbook‑style” corpora is raised as a key limitation on deep, expert‑level reasoning.

Progress, Benchmarks, and Hype

  • Some claim that purported “fundamental limitations” keep getting erased by new models (e.g., o3’s strong ARC‑AGI scores), while practical user experience hasn’t changed as dramatically.
  • Others stress that the article’s main paper is from 2023 and evaluated GPT‑3/3.5/early‑4, now viewed as “ancient,” so its empirical claims shouldn’t be read as the current frontier.
  • There is broad agreement that formalizing LLM failure modes on compositional tasks is valuable, but disagreement on how much these results constrain next‑generation, tool‑augmented systems.