Procedural knowledge in pretraining drives reasoning in large language models
Procedural Knowledge vs. Retrieval
- Core claim discussed: LLM reasoning traces on math problems seem driven more by procedural knowledge (step-by-step methods, formulas, code) than by memorized answers to identical questions.
- Commenters emphasize this as evidence of generalization over pure retrieval: models synthesize patterns for “how to solve” rather than just lookup.
- Some note this aligns with experiences that models often follow a reasoning path without self-correction; once on a path, backtracking is weak unless explicitly trained for it.
Memorization, Generalization, and “Reasoning”
- Debate over whether this is “memorization at a higher level” or genuine generalization.
- One view: shared weights force compression into patterns that generalize beyond seen examples.
- Another view: it’s still fundamentally pattern extrapolation, not human-like reasoning.
- Several distinguish “generalization” (pattern-based guessing) from “reasoning” (multi-step, flexible, with alternatives and backtracking), arguing LLMs do some of both but imperfectly.
- Others argue that if models produce correct, novel step-by-step solutions beyond training examples, calling that “reasoning” is justified.
Role of Training Data (Code, Textbooks, Notes)
- Participants link the findings to prior work showing benefits of mixing substantial code into training, especially for tasks needing long-range state tracking.
- Some note that major models use significant code percentages and that mixing text+code can outperform specialized-only training.
- There is interest in training more on textbooks, proofs, student notes, and worked examples, with the idea that procedural content and corrections may especially help reasoning.
- A separate thread connects this to pretraining for chip design, arguing strong pretraining is plausibly necessary for complex design reasoning.
Human vs. LLM Reasoning and Reliability
- Long meta-discussion compares human and LLM fallibility:
- Humans are also unreliable and often on “autopilot,” yet bear responsibility and can be incentivized.
- LLMs are powerful but opaque and hard to hold accountable; complexity, not nondeterminism per se, undermines responsibility.
- Some object to the term “reasoning” as marketing language; others defend everyday anthropomorphic terms (“thinking,” “reasoning”) as convenient approximations.
Impact and Expectations
- Several expect substantial economic and practical impact even from current imperfect models.
- Others stress that hype outpaces reliability, and that LLMs may be best used as a natural-language front-end to more formal tools (code, solvers) rather than as standalone reasoners.