2025-01-29

An analysis of DeepSeek's R1-Zero and R1

Performance, cost, and “reasoning” benchmarks

o3 greatly outperforms R1 and o1 on ARC‑AGI‑1, but only at extremely high test‑time compute (tens of millions of tokens, ~$3.4k per run in the cited setup).
Some see this as evidence of steeply rising marginal cost for each extra “percent of real reasoning.” Others argue the ability to pour in more compute is a feature, not a bug.
R1 is praised for cost‑efficiency and “punching above its weight,” and for being a good data generator to distill into smaller models.
There is criticism that o3 was tuned on ARC training data while o1/R1 were not, making headline comparisons somewhat misleading.

Verifiable rewards, RL, and domain limits

“Verifiable reward” is discussed as binary correctness (tests pass, proof checks, answer equals ground truth), loosely analogous to NP verification.
This works well for math and code, especially in sandboxed environments with test suites, but breaks down in most real‑world or subjective domains.
Even in math/CS, many interesting questions (depth of theorems, usefulness of definitions, quality of models or language designs) lack clear verifiable rewards.
Some argue theorem discovery and meaningfulness are partly verifiable; others say “meaningful” can’t be quantified, so RL can’t directly target it.

Human bottleneck and training data economics

R1‑Zero is framed as “removing the human bottleneck,” but commenters note it still relies on human‑curated pretraining and human/RL signals for non‑verifiable tasks.
A proposed flywheel: users pay for inference, their interactions generate labeled data, models improve, attracting more users. Skeptics doubt the novelty and quality of such data.
There is active interest in using reasoning models to generate synthetic chains‑of‑thought, then training cheaper base models on this; others worry this amplifies model biases and errors.

User feedback, poisoning, and data quality

Corrections like “no, that’s wrong” are seen as valuable RL signal, but models are not currently learning online; updates happen offline and heavily filtered.
Multiple comments discuss adversarial “data poisoning” (fake content, tools aimed at crawlers), and counter‑arguments that large labs can statistically detect and discard much of this, albeit at non‑trivial cost.

Future of coding and bespoke software

One camp envisions LLMs building full apps end‑to‑end (spec, code, tests, deployment), enabling “bespoke software for everyone.”
Others argue requirements elicitation, security, billing, oversight, and multi‑user value are the real hard parts; current agentic tools loop, waste tokens, and produce brittle code.
Some expect near‑term improvement (LLM as competent dev team), others think this will “almost certainly never materialize,” at least in the strong version.

Inference compute and Nvidia

Shift of spend from training to inference, plus expensive reasoning tokens, is expected to increase total compute demand.
For inference, Nvidia faces more competition (TPUs, Groq, Cerebras, AMD, on‑device), and several people report successful migration away from CUDA for serving.
Others insist the CUDA/software stack remains a deep moat for high‑end, fast‑moving workloads, especially in training; inference is the easiest layer to peel away.

Related topics