The Illustrated DeepSeek-R1
Author and Foundations vs. Fast-Changing Details
- Commenters praise the “Illustrated …” series as high quality and view the author’s name as a reliability signal.
- There’s skepticism that an LLM book can stay current, countered by the argument that core foundations (gradient descent, tokenization, embeddings, self-attention, MLPs, SFT, RLHF) change slowly even if products evolve quickly.
- Minor debate over mentioning “gradient descent” but not “Transformer encoders”; clarification that modern top LLMs are decoder‑only, with self‑attention + MLP as the core, while encoders remain for some tasks.
DeepSeek Training Data, Distillation, and “I’m GPT‑4” Confusion
- Several people don’t understand how DeepSeek trained so cheaply, or what the 14.8T tokens actually were, given the vague corpus description.
- Some suspect heavy use of GPT‑4 outputs or outright distillation; others counter that:
- Distillation via OpenAI API would not be cheaper (you still pay for compute and API).
- Lack of probability outputs from OpenAI makes true distillation harder; more likely they fine‑tuned on public GPT‑4‑style datasets and broad web “radiation.”
- Multiple users note many models say they’re ChatGPT/OpenAI due to training on web text where those terms dominate, not because they literally are those models.
Synthetic Data, Reasoning, and Creativity Evaluation
- Commenters highlight that large‑scale synthetic chain‑of‑thought data (hundreds of thousands of long CoT traces) is novel and expensive, and that CoT is explicitly part of training, not just an inference trick.
- There’s excitement that verifiable reasoning tasks look “solvable” via synthetic data; concern shifts to qualitative/creative domains.
- Debate on whether creativity can be evaluated:
- One side: art is irreducibly subjective and culture‑dependent.
- Other side: you can model raters + artworks, decompose images into features (composition, rule of thirds, familiarity vs. surprise), and learn predictive “creativity/appeal” scores.
- Distinction is drawn between combinatorial “card‑shuffling” creativity vs. inventing truly new concepts; some think the latter might emerge from noisy latent reasoning.
Technical Innovations Behind DeepSeek V3/R1
- Summary of claimed improvements:
- Latent Multi‑Head Attention: low‑rank compression of KV matrices to trade memory for compute with small accuracy loss.
- MoE with one shared expert + many small experts, with only a subset active per token; plus improved load balancing via biasing underused experts.
- Multi‑token prediction heads that train on several next‑token predictions, thought to be key for better sequence modeling.
- FP8 where possible and extensive infrastructure work (e.g., DualPipe, efficient all‑to‑all) to overcome bandwidth limits on H800 GPUs.
- Commenters are unsure which change drives the reported ~10× efficiency; many suspect multi‑token prediction + infrastructure as main contributors and note some skepticism versus the marketing claims.
R1 Capabilities: Impressed vs. Underwhelmed
- Some users find R1 or its distilled variants dramatically better than prior open models for coding, especially when run locally, and see RL‑style reasoning (search over action sequences + reward) as a fundamental qualitative jump over pure RLHF “vibe checks.”
- Others report R1 performing notably worse than o1/o1‑pro on complex real‑world coding and scientific tasks, with visible hallucinations and reasoning loops; they point to R1’s own paper admitting only modest gains over V3 on software engineering tasks.
- Consensus: R1 is a significant research and engineering milestone, but its practical reasoning quality vs. top proprietary models is contested and task‑dependent.
Cost, Openness, and Geopolitical Shock
- Many see the real story as:
- A Chinese team, constrained by export controls and weaker hardware, achieved near‑frontier benchmarks at dramatically lower reported training cost.
- Open weights and unusually detailed training disclosures reduce the mystique around closed‑model “secret sauce” and may weaken Nvidia’s and OpenAI’s perceived moats.
- Some emphasize that DeepSeek is not a casual side project but a serious spin‑out repurposing hedge‑fund compute and talent.
Censorship, Alignment, and Political Influence
- Long subthread debates using a Chinese model vs. Western ones:
- One concern: large‑scale deployment of a censored/“approved” model could shape homework help, political recommendations, and global narratives.
- Others respond that Western models are also heavily “aligned” (censored) and trained on dubiously obtained data; “nobody is innocent,” so cost and openness matter more.
- Some propose using a panel of diverse models to balance national and ideological biases.
- Several note that R1 itself exhibits both censorship and alignment; diversity of models is seen as a strength.
HN Meta and Presentation Style
- Users are puzzled that such a “high signal” post fell off the front page quickly while lower‑point posts linger; algorithm behavior remains opaque.
- One commenter criticizes the “illustrated” style as mostly text‑in‑boxes rather than deeply visual explanations, suggesting inspiration from more sophisticated visualization thinkers.