DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL

Training approach and data sources

  • Thread emphasizes that DeepScaleR fine-tunes an existing 1.5B base model (from Alibaba) with RL, rather than training from scratch on web-scale data.
  • Several comments note the shift from “crawl everything” to selective, higher-quality data and synthetic data (models training on conversations with stronger models).
  • RL phase is described as data-efficient: strong reasoning gains from relatively small curated datasets.
  • Some see this as evidence that open-source efforts can compete with “big boys” without replicating full-internet scraping.

Capabilities: strong at math, weak elsewhere

  • Consensus: this is a specialist math/reasoning model, not a generalist competitor to o1-preview.
  • Users report good performance on non-trivial math puzzles (e.g., sums of cubes, medical board-style questions, jug puzzles) and “overthinking” even simple tasks like 1+1.
  • At the same time, multiple tests find it “pretty stupid” outside math: fails ASCII decoding, struggles with basic coding, misremembers algorithms, and behaves like “high school math homework solver” only.

Quantization, tokenization, and small-model fragility

  • Experiments via Ollama show bizarre behavior on the “count Rs in ‘strawberry’” prompt, with the model hallucinating letter sequences like “strawfurber.”
  • That bug persists even at FP32 GGUF but disappears at F16/bfloat16; authors say small models are highly sensitive to quantization and recommend bfloat16.
  • Discussion speculates about tokenization issues and hints at possible exploitable quirks.

Benchmarks, overfitting, and trust

  • Some praise: beating o1-preview on math benchmarks and doing RL for ~$4,500 (claimed ~18× cheaper than DeepSeek R1) is seen as nontrivial and exciting for edge devices.
  • Others argue this is likely “overfitting to evals”: fine-tuning narrowly on public math benchmarks says little about general capability.
  • Concerns that AIME and similar benchmarks have leaked online; broader skepticism that static benchmarks are too easy to game.
  • Suggestions include dynamic/parametric benchmarks and more human evals. A later comment claims competing model rStar-Math is misreported and actually outperforms DeepScaleR on multiple math sets, implying potential errors or cherry-picking.

Specialist vs generalist, tools, and broader implications

  • Several comments foresee many small, specialized models coordinated by a generalist “orchestrator” (Mixture-of-Models).
  • Others argue broad-and-deep generalists remain crucial for creative, cross-domain work.
  • There is interest in combining chain-of-thought models with calculators, code interpreters, and search tools, and in training models to “think by tool calls.”
  • Many see open-source and small RL-tuned models as rapidly advancing, with particular promise for on-device/edge AI.