Seven replies to the viral Apple reasoning paper and why they fall short

AI Hype, Usefulness, and Reliability

  • Many see the article and Apple paper as needed “cold water” on the hype cycle: LLMs are impressive and useful, but heavily oversold, especially for critical or high‑reliability tasks.
  • Others argue current systems are already extraordinary general tools (chatting more intelligently than “90% of people”), but critics respond that a tool that’s right ~70% of the time is unusable in many domains (finance, mailroom, banking, law).

Gary Marcus: Value of Critique vs Bias

  • Some commenters view Marcus as a necessary counterweight to AGI boosterism, consistently calling out hallucinations, safety issues, and hype.
  • Others see a long-standing, repetitive neurosymbolic agenda: dismissing deep learning, overstating failures, and never seriously engaging with LLMs’ practical successes.
  • Debate over ad‑hominem: whether his past predictions matter versus engaging with the actual arguments in this piece.

Apple “Illusion of Thinking” Paper: Methods and Goals

  • Supporters say the paper shows reasoning models break down on systematic puzzles like Towers of Hanoi, suggesting apparent reasoning often relies on pattern recall.
  • Critics argue:
    • The central claim (truly “novel” puzzles) is untestable because you can’t know what was in the training set.
    • The setup forces models to reason entirely in-context, disallowing tools (code, search) that are central to practical use.
    • Models were marked “wrong” even when they correctly gave algorithms and partial sequences but stopped before listing tens of thousands of moves.
  • Some point to a separate arXiv rebuttal that directly challenges the Apple conclusions and note Marcus barely engages with it.

Do LLMs “Reason” or Just Memorize?

  • One camp: LLMs mostly repeat patterns seen in training; they fail badly on many genuinely novel or multi-step tasks, can’t signal ignorance reliably, and can’t yet replace an average unsupervised worker.
  • Another camp: they clearly exhibit some generalization and low‑grade reasoning (synthetic languages, puzzles like Monty Hall, ad‑hoc APIs, estimating “pianos at the bottom of the sea”), especially with tool use and larger models. Limitations exist, but capability is on a spectrum, not zero-or-AGI.

AGI Definitions and Expectations

  • Confusion between AGI (“human-level across most cognitive tasks”) and ASI (“superintelligence”) recurs.
  • Some argue matching average human performance (with human-like flaws) is enough for AGI; others insist on self-teaching, metacognition, and robust novel problem-solving.
  • Several see both hype (“AGI is imminent”) and anti‑hype (“LLMs are useless parrots”) as symmetrical extremes; real progress and real limits coexist.