Procedural knowledge in pretraining drives reasoning in large language models

Procedural Knowledge vs. Retrieval

  • Core claim discussed: LLM reasoning traces on math problems seem driven more by procedural knowledge (step-by-step methods, formulas, code) than by memorized answers to identical questions.
  • Commenters emphasize this as evidence of generalization over pure retrieval: models synthesize patterns for “how to solve” rather than just lookup.
  • Some note this aligns with experiences that models often follow a reasoning path without self-correction; once on a path, backtracking is weak unless explicitly trained for it.

Memorization, Generalization, and “Reasoning”

  • Debate over whether this is “memorization at a higher level” or genuine generalization.
    • One view: shared weights force compression into patterns that generalize beyond seen examples.
    • Another view: it’s still fundamentally pattern extrapolation, not human-like reasoning.
  • Several distinguish “generalization” (pattern-based guessing) from “reasoning” (multi-step, flexible, with alternatives and backtracking), arguing LLMs do some of both but imperfectly.
  • Others argue that if models produce correct, novel step-by-step solutions beyond training examples, calling that “reasoning” is justified.

Role of Training Data (Code, Textbooks, Notes)

  • Participants link the findings to prior work showing benefits of mixing substantial code into training, especially for tasks needing long-range state tracking.
  • Some note that major models use significant code percentages and that mixing text+code can outperform specialized-only training.
  • There is interest in training more on textbooks, proofs, student notes, and worked examples, with the idea that procedural content and corrections may especially help reasoning.
  • A separate thread connects this to pretraining for chip design, arguing strong pretraining is plausibly necessary for complex design reasoning.

Human vs. LLM Reasoning and Reliability

  • Long meta-discussion compares human and LLM fallibility:
    • Humans are also unreliable and often on “autopilot,” yet bear responsibility and can be incentivized.
    • LLMs are powerful but opaque and hard to hold accountable; complexity, not nondeterminism per se, undermines responsibility.
  • Some object to the term “reasoning” as marketing language; others defend everyday anthropomorphic terms (“thinking,” “reasoning”) as convenient approximations.

Impact and Expectations

  • Several expect substantial economic and practical impact even from current imperfect models.
  • Others stress that hype outpaces reliability, and that LLMs may be best used as a natural-language front-end to more formal tools (code, solvers) rather than as standalone reasoners.