Run DeepSeek R1 Dynamic 1.58-bit

Model design, scaling, and training approaches

  • Some hope future base models will target 128GB-class consumer hardware, e.g. MoE with ~16B active params, leveraging heavy quantization and strong routing.
  • Commenters note DeepSeek already uses multi-stage training where smaller reasoning models generate synthetic data for larger ones; this is compared conceptually to “dreaming”.
  • Discussion on FP8/INT8 training: DeepSeek’s large‑scale FP8 training without loss spikes is seen as technically notable.

1.58‑bit / dynamic quantization findings

  • Naive uniform 1.58‑bit quantization leads to “fried” models: infinite repetition, forgetting context, and general nonsense.
  • Several argue repetition penalties or advanced samplers (DRY, min_p, temperature tweaks) can mitigate symptoms but cannot restore true accuracy if probabilities are too distorted.
  • The “dynamic” scheme—keeping sensitive components (e.g. attention, some projections) at higher precision and applying 1.58‑bit mainly to MoE experts—largely removes the repetition problem while delivering ~80% size reduction.
  • Debate on how far such extreme quantization can go before it’s better to use a smaller but higher‑precision model.

Running huge MoE models: hardware and parallelism

  • MoE is described as memory‑bound: only a fraction of experts are active per token (e.g. 8/256), but routing incurs heavy all‑to‑all GPU communication.
  • Inference strategies discussed: pipeline parallelism (layer‑wise sharding), tensor parallelism, and combinations thereof.
  • Many compare options: multi‑3090 rigs vs 192GB Mac Ultra vs upcoming AMD APUs and Nvidia “Digits”; trade‑offs revolve around VRAM+RAM, bandwidth, power, and portability.
  • CPU‑only (EPYC/Threadripper) is seen as workable but slow; bandwidth, not capacity, is usually the main bottleneck.

Practical usability and benchmarks

  • 1.58‑bit R1 reportedly reaches ~140 tok/s on dual H100s; some users get a few tok/s on multi‑GPU consumer rigs—usable but not snappy.
  • Several ask for standard benchmarks; lack of direct evals versus full‑precision R1 leaves “how lobotomized is it?” somewhat unclear.

Ecosystem, OpenAI, and market impact

  • Strong disagreement on claims that DeepSeek “kills” OpenAI: some predict OpenAI’s decline; others argue large labs will adopt similar efficiency tricks and still win via scale (Jevons paradox).
  • Many stress DeepSeek’s significance is cost/efficiency and the open training recipe, not just the released weights.
  • Ongoing concerns about censorship (e.g. on politically sensitive topics) and about calling “open weights” models truly “open source.”

Distills, Ollama, and local use

  • Distilled R1 variants (Qwen/Llama 7–70B) are widely used locally but are consistently reported as weaker and less knowledgeable than full R1, merely imitating its reasoning style.
  • Some accuse tools of marketing distills as “R1” and confusing non‑experts.
  • For many real workloads (RAG, narrow tasks), moderate‑size quantized models remain sufficient and cheaper to run.