Emerging reasoning with reinforcement learning

What the paper/post is doing

  • Describes reproducing DeepSeek R1–style “reasoning via RL” on a small (~7B) math model with ~8k problems.
  • Uses simple reinforcement learning: reward only on final correctness, no explicit supervision of intermediate steps.
  • Result: chain-of-thought (CoT)–style reasoning “emerges” in a model that previously did not show it, but overall capability is still limited by small size.

Chain-of-thought and why it helps

  • CoT = having the model “think out loud” step by step before answering.
  • Previously taught mainly via supervised fine-tuning on hand-written reasoning traces, which are expensive.
  • Discussion of why CoT works:
    • More tokens = more computation time.
    • Breaking problems into smaller substeps makes search in solution space easier.
    • RL can nudge models to take longer, more cautious paths when harder tasks require many small corrective steps.
  • Some argue this is best seen as iterative search through latent space rather than “human-like reasoning.”

RL vs SFT vs distillation

  • DeepSeek’s own paper emphasizes distilling reasoning patterns from a large RL-trained model into smaller ones via SFT; they did not RL-train the small distilled models.
  • Debate:
    • One side: distillation from a stronger reasoning model beats doing RL directly on small models.
    • Other side: this work shows small models can learn CoT via RL alone; question becomes how small, how cheap, and on what tasks.
  • Concerns: RL is compute-heavy compared to SFT at equal data; RL-tuned models can become “stubborn” and ignore prompts outside their reward-shaped niche.

Does this show “real reasoning”?

  • Enthusiasts: emergent CoT and self-correction on hard math are strong evidence of genuine reasoning, undermining the “stochastic parrots / mere regurgitation” view.
  • Skeptics:
    • Argue it’s still token-level pattern generation, akin to structured search or calculators plus fuzzy lookup.
    • Note lack of embodiment, motivation, episodic memory, and continuous online learning.
    • Emphasize that words like “reason,” “emergent,” “intelligent” are being stretched; much of the debate is about definitions.

Broader implications and open questions

  • If RL can cheaply boost reasoning on any capable base model, “reasoning models” may become a commodity, eroding proprietary moats.
  • Potential to apply similar RL setups to non-math domains (code, science, finance, medicine), but this is speculative in the thread.
  • Open questions raised:
    • How to design reward functions that reliably elicit desired emergent behaviors.
    • Whether similar methods power proprietary models (o1/o3, Gemini “thinking” variants) — currently unclear.
    • How far this moves systems toward general-purpose reasoning and whether AGI is on the visible trendline.