Does Reasoning Emerge? Probabilities of Causation in Large Language Models
Benchmarking “True” Reasoning
- The paper’s PN/PS trick-question setup is seen as clever but fragile: once benchmarks are public, future models can be tuned to them, “studying for the test” rather than developing robust reasoning.
- Some argue randomizing multiple‑choice options and question surface forms helps, but others note models can be retrained on randomized variants too.
- A few see causal-probability tests as a promising engineering metric for simple inference and counterfactual robustness, but the current tasks (short Boolean chains) are considered too narrow to capture complex reasoning.
Simulation vs Real Reasoning and Consciousness
- Several comments question whether there is any observable difference between “real” and “fake” reasoning; if outputs are sound and consistent, they ask, does the internal process matter?
- Others argue it does matter: models that game benchmarks without deeper cognition will fail unpredictably in novel settings.
- Long philosophical subthreads debate consciousness vs free will, simulation vs reality, and whether we could ever empirically distinguish “simulated” from “real” consciousness or reasoning.
Pattern Matching, Abstraction, and Causality
- Many see LLMs as powerful pattern matchers that interpolate between seen examples but struggle with symbolic abstraction, counterfactuals, and transferring reasoning across domains.
- The paper’s low tolerance for counterfactual errors is interpreted by some as evidence of an architectural ceiling for reasoning in current LLMs.
- Others note humans also rely heavily on pattern matching and are easily fooled by classic trick problems, though humans can usually “move up a level” and correct themselves when prompted.
Human vs LLM Intelligence
- Debate centers on whether exam performance (coding, legal, medical) implies comparable “general intelligence.”
- Pro‑LLM side: models outperform many humans on standardized tests and can tackle a huge breadth of tasks.
- Skeptical side: humans exhibit robust reasoning in novel, embodied, low‑data contexts (e.g., driving, spatial navigation, everyday planning) where LLMs fail or need massive training.
- Differences in architecture are emphasized: brains support continuous online learning, long‑term attention, exploration, and rich sensory grounding; transformers mostly process token streams.
Practical Utility and ROI
- Several comments claim current AI shows disappointing return on investment at scale and lacks “real” reasoning power; others counter that there is already substantial value in narrow, low‑/medium‑stakes tasks.
- There is broad agreement that marketing has oversold “AGI‑like” capabilities relative to what LLMs actually deliver.
Architecture, Agents, and Future Directions
- Some argue that true reasoning will require loops, sub‑systems, internal monitoring, and multi‑step planning (e.g., agent frameworks, “Id/Ego/Superego” style decompositions).
- Others think scale and better training alone may continue to improve apparent reasoning, but there is no consensus on whether that will reach human‑level abstraction.