Run DeepSeek R1 Dynamic 1.58-bit
Model design, scaling, and training approaches
- Some hope future base models will target 128GB-class consumer hardware, e.g. MoE with ~16B active params, leveraging heavy quantization and strong routing.
- Commenters note DeepSeek already uses multi-stage training where smaller reasoning models generate synthetic data for larger ones; this is compared conceptually to “dreaming”.
- Discussion on FP8/INT8 training: DeepSeek’s large‑scale FP8 training without loss spikes is seen as technically notable.
1.58‑bit / dynamic quantization findings
- Naive uniform 1.58‑bit quantization leads to “fried” models: infinite repetition, forgetting context, and general nonsense.
- Several argue repetition penalties or advanced samplers (DRY, min_p, temperature tweaks) can mitigate symptoms but cannot restore true accuracy if probabilities are too distorted.
- The “dynamic” scheme—keeping sensitive components (e.g. attention, some projections) at higher precision and applying 1.58‑bit mainly to MoE experts—largely removes the repetition problem while delivering ~80% size reduction.
- Debate on how far such extreme quantization can go before it’s better to use a smaller but higher‑precision model.
Running huge MoE models: hardware and parallelism
- MoE is described as memory‑bound: only a fraction of experts are active per token (e.g. 8/256), but routing incurs heavy all‑to‑all GPU communication.
- Inference strategies discussed: pipeline parallelism (layer‑wise sharding), tensor parallelism, and combinations thereof.
- Many compare options: multi‑3090 rigs vs 192GB Mac Ultra vs upcoming AMD APUs and Nvidia “Digits”; trade‑offs revolve around VRAM+RAM, bandwidth, power, and portability.
- CPU‑only (EPYC/Threadripper) is seen as workable but slow; bandwidth, not capacity, is usually the main bottleneck.
Practical usability and benchmarks
- 1.58‑bit R1 reportedly reaches ~140 tok/s on dual H100s; some users get a few tok/s on multi‑GPU consumer rigs—usable but not snappy.
- Several ask for standard benchmarks; lack of direct evals versus full‑precision R1 leaves “how lobotomized is it?” somewhat unclear.
Ecosystem, OpenAI, and market impact
- Strong disagreement on claims that DeepSeek “kills” OpenAI: some predict OpenAI’s decline; others argue large labs will adopt similar efficiency tricks and still win via scale (Jevons paradox).
- Many stress DeepSeek’s significance is cost/efficiency and the open training recipe, not just the released weights.
- Ongoing concerns about censorship (e.g. on politically sensitive topics) and about calling “open weights” models truly “open source.”
Distills, Ollama, and local use
- Distilled R1 variants (Qwen/Llama 7–70B) are widely used locally but are consistently reported as weaker and less knowledgeable than full R1, merely imitating its reasoning style.
- Some accuse tools of marketing distills as “R1” and confusing non‑experts.
- For many real workloads (RAG, narrow tasks), moderate‑size quantized models remain sufficient and cheaper to run.