2024-08-16

Does Reasoning Emerge? Probabilities of Causation in Large Language Models

Benchmarking “True” Reasoning

The paper’s PN/PS trick-question setup is seen as clever but fragile: once benchmarks are public, future models can be tuned to them, “studying for the test” rather than developing robust reasoning.
Some argue randomizing multiple‑choice options and question surface forms helps, but others note models can be retrained on randomized variants too.
A few see causal-probability tests as a promising engineering metric for simple inference and counterfactual robustness, but the current tasks (short Boolean chains) are considered too narrow to capture complex reasoning.

Simulation vs Real Reasoning and Consciousness

Several comments question whether there is any observable difference between “real” and “fake” reasoning; if outputs are sound and consistent, they ask, does the internal process matter?
Others argue it does matter: models that game benchmarks without deeper cognition will fail unpredictably in novel settings.
Long philosophical subthreads debate consciousness vs free will, simulation vs reality, and whether we could ever empirically distinguish “simulated” from “real” consciousness or reasoning.

Pattern Matching, Abstraction, and Causality

Many see LLMs as powerful pattern matchers that interpolate between seen examples but struggle with symbolic abstraction, counterfactuals, and transferring reasoning across domains.
The paper’s low tolerance for counterfactual errors is interpreted by some as evidence of an architectural ceiling for reasoning in current LLMs.
Others note humans also rely heavily on pattern matching and are easily fooled by classic trick problems, though humans can usually “move up a level” and correct themselves when prompted.

Human vs LLM Intelligence

Debate centers on whether exam performance (coding, legal, medical) implies comparable “general intelligence.”
- Pro‑LLM side: models outperform many humans on standardized tests and can tackle a huge breadth of tasks.
- Skeptical side: humans exhibit robust reasoning in novel, embodied, low‑data contexts (e.g., driving, spatial navigation, everyday planning) where LLMs fail or need massive training.
Differences in architecture are emphasized: brains support continuous online learning, long‑term attention, exploration, and rich sensory grounding; transformers mostly process token streams.

Practical Utility and ROI

Several comments claim current AI shows disappointing return on investment at scale and lacks “real” reasoning power; others counter that there is already substantial value in narrow, low‑/medium‑stakes tasks.
There is broad agreement that marketing has oversold “AGI‑like” capabilities relative to what LLMs actually deliver.

Architecture, Agents, and Future Directions

Some argue that true reasoning will require loops, sub‑systems, internal monitoring, and multi‑step planning (e.g., agent frameworks, “Id/Ego/Superego” style decompositions).
Others think scale and better training alone may continue to improve apparent reasoning, but there is no consensus on whether that will reach human‑level abstraction.

Related topics