Does Reasoning Emerge? Probabilities of Causation in Large Language Models

Benchmarking “True” Reasoning

  • The paper’s PN/PS trick-question setup is seen as clever but fragile: once benchmarks are public, future models can be tuned to them, “studying for the test” rather than developing robust reasoning.
  • Some argue randomizing multiple‑choice options and question surface forms helps, but others note models can be retrained on randomized variants too.
  • A few see causal-probability tests as a promising engineering metric for simple inference and counterfactual robustness, but the current tasks (short Boolean chains) are considered too narrow to capture complex reasoning.

Simulation vs Real Reasoning and Consciousness

  • Several comments question whether there is any observable difference between “real” and “fake” reasoning; if outputs are sound and consistent, they ask, does the internal process matter?
  • Others argue it does matter: models that game benchmarks without deeper cognition will fail unpredictably in novel settings.
  • Long philosophical subthreads debate consciousness vs free will, simulation vs reality, and whether we could ever empirically distinguish “simulated” from “real” consciousness or reasoning.

Pattern Matching, Abstraction, and Causality

  • Many see LLMs as powerful pattern matchers that interpolate between seen examples but struggle with symbolic abstraction, counterfactuals, and transferring reasoning across domains.
  • The paper’s low tolerance for counterfactual errors is interpreted by some as evidence of an architectural ceiling for reasoning in current LLMs.
  • Others note humans also rely heavily on pattern matching and are easily fooled by classic trick problems, though humans can usually “move up a level” and correct themselves when prompted.

Human vs LLM Intelligence

  • Debate centers on whether exam performance (coding, legal, medical) implies comparable “general intelligence.”
    • Pro‑LLM side: models outperform many humans on standardized tests and can tackle a huge breadth of tasks.
    • Skeptical side: humans exhibit robust reasoning in novel, embodied, low‑data contexts (e.g., driving, spatial navigation, everyday planning) where LLMs fail or need massive training.
  • Differences in architecture are emphasized: brains support continuous online learning, long‑term attention, exploration, and rich sensory grounding; transformers mostly process token streams.

Practical Utility and ROI

  • Several comments claim current AI shows disappointing return on investment at scale and lacks “real” reasoning power; others counter that there is already substantial value in narrow, low‑/medium‑stakes tasks.
  • There is broad agreement that marketing has oversold “AGI‑like” capabilities relative to what LLMs actually deliver.

Architecture, Agents, and Future Directions

  • Some argue that true reasoning will require loops, sub‑systems, internal monitoring, and multi‑step planning (e.g., agent frameworks, “Id/Ego/Superego” style decompositions).
  • Others think scale and better training alone may continue to improve apparent reasoning, but there is no consensus on whether that will reach human‑level abstraction.