2025-02-11

DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL

Training approach and data sources

Thread emphasizes that DeepScaleR fine-tunes an existing 1.5B base model (from Alibaba) with RL, rather than training from scratch on web-scale data.
Several comments note the shift from “crawl everything” to selective, higher-quality data and synthetic data (models training on conversations with stronger models).
RL phase is described as data-efficient: strong reasoning gains from relatively small curated datasets.
Some see this as evidence that open-source efforts can compete with “big boys” without replicating full-internet scraping.

Capabilities: strong at math, weak elsewhere

Consensus: this is a specialist math/reasoning model, not a generalist competitor to o1-preview.
Users report good performance on non-trivial math puzzles (e.g., sums of cubes, medical board-style questions, jug puzzles) and “overthinking” even simple tasks like 1+1.
At the same time, multiple tests find it “pretty stupid” outside math: fails ASCII decoding, struggles with basic coding, misremembers algorithms, and behaves like “high school math homework solver” only.

Quantization, tokenization, and small-model fragility

Experiments via Ollama show bizarre behavior on the “count Rs in ‘strawberry’” prompt, with the model hallucinating letter sequences like “strawfurber.”
That bug persists even at FP32 GGUF but disappears at F16/bfloat16; authors say small models are highly sensitive to quantization and recommend bfloat16.
Discussion speculates about tokenization issues and hints at possible exploitable quirks.

Benchmarks, overfitting, and trust

Some praise: beating o1-preview on math benchmarks and doing RL for ~$4,500 (claimed ~18× cheaper than DeepSeek R1) is seen as nontrivial and exciting for edge devices.
Others argue this is likely “overfitting to evals”: fine-tuning narrowly on public math benchmarks says little about general capability.
Concerns that AIME and similar benchmarks have leaked online; broader skepticism that static benchmarks are too easy to game.
Suggestions include dynamic/parametric benchmarks and more human evals. A later comment claims competing model rStar-Math is misreported and actually outperforms DeepScaleR on multiple math sets, implying potential errors or cherry-picking.

Specialist vs generalist, tools, and broader implications

Several comments foresee many small, specialized models coordinated by a generalist “orchestrator” (Mixture-of-Models).
Others argue broad-and-deep generalists remain crucial for creative, cross-domain work.
There is interest in combining chain-of-thought models with calculators, code interpreters, and search tools, and in training models to “think by tool calls.”
Many see open-source and small RL-tuned models as rapidly advancing, with particular promise for on-device/edge AI.

Related topics