2025-01-28

Run DeepSeek R1 Dynamic 1.58-bit

Model design, scaling, and training approaches

Some hope future base models will target 128GB-class consumer hardware, e.g. MoE with ~16B active params, leveraging heavy quantization and strong routing.
Commenters note DeepSeek already uses multi-stage training where smaller reasoning models generate synthetic data for larger ones; this is compared conceptually to “dreaming”.
Discussion on FP8/INT8 training: DeepSeek’s large‑scale FP8 training without loss spikes is seen as technically notable.

1.58‑bit / dynamic quantization findings

Naive uniform 1.58‑bit quantization leads to “fried” models: infinite repetition, forgetting context, and general nonsense.
Several argue repetition penalties or advanced samplers (DRY, min_p, temperature tweaks) can mitigate symptoms but cannot restore true accuracy if probabilities are too distorted.
The “dynamic” scheme—keeping sensitive components (e.g. attention, some projections) at higher precision and applying 1.58‑bit mainly to MoE experts—largely removes the repetition problem while delivering ~80% size reduction.
Debate on how far such extreme quantization can go before it’s better to use a smaller but higher‑precision model.

Running huge MoE models: hardware and parallelism

MoE is described as memory‑bound: only a fraction of experts are active per token (e.g. 8/256), but routing incurs heavy all‑to‑all GPU communication.
Inference strategies discussed: pipeline parallelism (layer‑wise sharding), tensor parallelism, and combinations thereof.
Many compare options: multi‑3090 rigs vs 192GB Mac Ultra vs upcoming AMD APUs and Nvidia “Digits”; trade‑offs revolve around VRAM+RAM, bandwidth, power, and portability.
CPU‑only (EPYC/Threadripper) is seen as workable but slow; bandwidth, not capacity, is usually the main bottleneck.

Practical usability and benchmarks

1.58‑bit R1 reportedly reaches ~140 tok/s on dual H100s; some users get a few tok/s on multi‑GPU consumer rigs—usable but not snappy.
Several ask for standard benchmarks; lack of direct evals versus full‑precision R1 leaves “how lobotomized is it?” somewhat unclear.

Ecosystem, OpenAI, and market impact

Strong disagreement on claims that DeepSeek “kills” OpenAI: some predict OpenAI’s decline; others argue large labs will adopt similar efficiency tricks and still win via scale (Jevons paradox).
Many stress DeepSeek’s significance is cost/efficiency and the open training recipe, not just the released weights.
Ongoing concerns about censorship (e.g. on politically sensitive topics) and about calling “open weights” models truly “open source.”

Distills, Ollama, and local use

Distilled R1 variants (Qwen/Llama 7–70B) are widely used locally but are consistently reported as weaker and less knowledgeable than full R1, merely imitating its reasoning style.
Some accuse tools of marketing distills as “R1” and confusing non‑experts.
For many real workloads (RAG, narrow tasks), moderate‑size quantized models remain sufficient and cheaper to run.

Related topics