Emerging reasoning with reinforcement learning
What the paper/post is doing
- Describes reproducing DeepSeek R1–style “reasoning via RL” on a small (~7B) math model with ~8k problems.
- Uses simple reinforcement learning: reward only on final correctness, no explicit supervision of intermediate steps.
- Result: chain-of-thought (CoT)–style reasoning “emerges” in a model that previously did not show it, but overall capability is still limited by small size.
Chain-of-thought and why it helps
- CoT = having the model “think out loud” step by step before answering.
- Previously taught mainly via supervised fine-tuning on hand-written reasoning traces, which are expensive.
- Discussion of why CoT works:
- More tokens = more computation time.
- Breaking problems into smaller substeps makes search in solution space easier.
- RL can nudge models to take longer, more cautious paths when harder tasks require many small corrective steps.
- Some argue this is best seen as iterative search through latent space rather than “human-like reasoning.”
RL vs SFT vs distillation
- DeepSeek’s own paper emphasizes distilling reasoning patterns from a large RL-trained model into smaller ones via SFT; they did not RL-train the small distilled models.
- Debate:
- One side: distillation from a stronger reasoning model beats doing RL directly on small models.
- Other side: this work shows small models can learn CoT via RL alone; question becomes how small, how cheap, and on what tasks.
- Concerns: RL is compute-heavy compared to SFT at equal data; RL-tuned models can become “stubborn” and ignore prompts outside their reward-shaped niche.
Does this show “real reasoning”?
- Enthusiasts: emergent CoT and self-correction on hard math are strong evidence of genuine reasoning, undermining the “stochastic parrots / mere regurgitation” view.
- Skeptics:
- Argue it’s still token-level pattern generation, akin to structured search or calculators plus fuzzy lookup.
- Note lack of embodiment, motivation, episodic memory, and continuous online learning.
- Emphasize that words like “reason,” “emergent,” “intelligent” are being stretched; much of the debate is about definitions.
Broader implications and open questions
- If RL can cheaply boost reasoning on any capable base model, “reasoning models” may become a commodity, eroding proprietary moats.
- Potential to apply similar RL setups to non-math domains (code, science, finance, medicine), but this is speculative in the thread.
- Open questions raised:
- How to design reward functions that reliably elicit desired emergent behaviors.
- Whether similar methods power proprietary models (o1/o3, Gemini “thinking” variants) — currently unclear.
- How far this moves systems toward general-purpose reasoning and whether AGI is on the visible trendline.