2025-01-26

Emerging reasoning with reinforcement learning

What the paper/post is doing

Describes reproducing DeepSeek R1–style “reasoning via RL” on a small (~7B) math model with ~8k problems.
Uses simple reinforcement learning: reward only on final correctness, no explicit supervision of intermediate steps.
Result: chain-of-thought (CoT)–style reasoning “emerges” in a model that previously did not show it, but overall capability is still limited by small size.

Chain-of-thought and why it helps

CoT = having the model “think out loud” step by step before answering.
Previously taught mainly via supervised fine-tuning on hand-written reasoning traces, which are expensive.
Discussion of why CoT works:
- More tokens = more computation time.
- Breaking problems into smaller substeps makes search in solution space easier.
- RL can nudge models to take longer, more cautious paths when harder tasks require many small corrective steps.
Some argue this is best seen as iterative search through latent space rather than “human-like reasoning.”

RL vs SFT vs distillation

DeepSeek’s own paper emphasizes distilling reasoning patterns from a large RL-trained model into smaller ones via SFT; they did not RL-train the small distilled models.
Debate:
- One side: distillation from a stronger reasoning model beats doing RL directly on small models.
- Other side: this work shows small models can learn CoT via RL alone; question becomes how small, how cheap, and on what tasks.
Concerns: RL is compute-heavy compared to SFT at equal data; RL-tuned models can become “stubborn” and ignore prompts outside their reward-shaped niche.

Does this show “real reasoning”?

Enthusiasts: emergent CoT and self-correction on hard math are strong evidence of genuine reasoning, undermining the “stochastic parrots / mere regurgitation” view.
Skeptics:
- Argue it’s still token-level pattern generation, akin to structured search or calculators plus fuzzy lookup.
- Note lack of embodiment, motivation, episodic memory, and continuous online learning.
- Emphasize that words like “reason,” “emergent,” “intelligent” are being stretched; much of the debate is about definitions.

Broader implications and open questions

If RL can cheaply boost reasoning on any capable base model, “reasoning models” may become a commodity, eroding proprietary moats.
Potential to apply similar RL setups to non-math domains (code, science, finance, medicine), but this is speculative in the thread.
Open questions raised:
- How to design reward functions that reliably elicit desired emergent behaviors.
- Whether similar methods power proprietary models (o1/o3, Gemini “thinking” variants) — currently unclear.
- How far this moves systems toward general-purpose reasoning and whether AGI is on the visible trendline.

Related topics