2024-12-01

Procedural knowledge in pretraining drives reasoning in large language models

Procedural Knowledge vs. Retrieval

Core claim discussed: LLM reasoning traces on math problems seem driven more by procedural knowledge (step-by-step methods, formulas, code) than by memorized answers to identical questions.
Commenters emphasize this as evidence of generalization over pure retrieval: models synthesize patterns for “how to solve” rather than just lookup.
Some note this aligns with experiences that models often follow a reasoning path without self-correction; once on a path, backtracking is weak unless explicitly trained for it.

Memorization, Generalization, and “Reasoning”

Debate over whether this is “memorization at a higher level” or genuine generalization.
- One view: shared weights force compression into patterns that generalize beyond seen examples.
- Another view: it’s still fundamentally pattern extrapolation, not human-like reasoning.
Several distinguish “generalization” (pattern-based guessing) from “reasoning” (multi-step, flexible, with alternatives and backtracking), arguing LLMs do some of both but imperfectly.
Others argue that if models produce correct, novel step-by-step solutions beyond training examples, calling that “reasoning” is justified.

Role of Training Data (Code, Textbooks, Notes)

Participants link the findings to prior work showing benefits of mixing substantial code into training, especially for tasks needing long-range state tracking.
Some note that major models use significant code percentages and that mixing text+code can outperform specialized-only training.
There is interest in training more on textbooks, proofs, student notes, and worked examples, with the idea that procedural content and corrections may especially help reasoning.
A separate thread connects this to pretraining for chip design, arguing strong pretraining is plausibly necessary for complex design reasoning.

Human vs. LLM Reasoning and Reliability

Long meta-discussion compares human and LLM fallibility:
- Humans are also unreliable and often on “autopilot,” yet bear responsibility and can be incentivized.
- LLMs are powerful but opaque and hard to hold accountable; complexity, not nondeterminism per se, undermines responsibility.
Some object to the term “reasoning” as marketing language; others defend everyday anthropomorphic terms (“thinking,” “reasoning”) as convenient approximations.

Impact and Expectations

Several expect substantial economic and practical impact even from current imperfect models.
Others stress that hype outpaces reliability, and that LLMs may be best used as a natural-language front-end to more formal tools (code, solvers) rather than as standalone reasoners.

Related topics