DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon
Headline vs. Reality
- Multiple commenters say the title is misleading: you are not truly “running DeepSeek-R1-671B on 1–2 A770s” in VRAM.
- The guide’s early example is a 7B distilled model, but deeper in it discusses the full DeepSeek-R1 Q4_K_M with FlashMoE.
- The actual recipe: ~380 GB of system RAM + 1–8 Intel Arc A770 GPUs + 500 GB disk. Most of the model resides in CPU memory; GPUs offload a smaller portion.
MoE Architecture & Active Parameters
- DeepSeek V3/R1 is a sparse MoE: out of 256 experts per layer, K=8 routed experts plus 1 shared expert are active each forward pass.
- “37B” refers to active parameters per token (experts + router + overhead), not size of a single expert.
- If the same experts are selected for consecutive tokens, later tokens can reuse experts in VRAM and behave more like a 37B model in practice; if experts change frequently, CPU–GPU transfers dominate.
Implementation Details (llama.cpp / ipex-llm)
- This builds on llama.cpp with Intel’s ipex-llm extensions for hybrid CPU–GPU MoE.
- One commenter notes llama.cpp traditionally splits layers between CPU and GPU, so GPU speedups are gated by CPU layers; Intel says they add extra MoE-specific optimizations.
- With a single A770, context length appears limited (~1024 tokens); more GPUs may allow longer context.
Performance & Benchmarks
- Official TPS numbers for DeepSeek-R1 in this setup are sparse; only a claim of “>8 tokens/s” on a dual-socket 5th‑gen Xeon.
- People criticize other large-CPU rigs (e.g., dual Epyc) that get 3–4 tok/s on reasoning models as effectively unusable for long think phases.
- Others argue even slow local setups are valuable for development or for those prioritizing locality over speed.
Hardware, Cost, and Alternatives
- Xeon is favored for many memory channels and PCIe lanes; consumer CPUs typically top out at 128–256 GB RAM.
- Some argue 384 GB DDR4 is now relatively cheap; others note Intel likely used high-end DDR5 Xeons to hit reported speeds.
- Debate over multi-GPU vs buying a single large-VRAM accelerator; multi-GPU is often tricky and not always faster.
- Quantization tradeoffs: Q4 seen as weak for coding, Q5 as a sweet spot; very low-bit (~Q2) DeepSeek-R1/V2.5 quants can still be surprisingly capable, especially outside creative writing.
- Ecosystem sentiment: for serious work, Nvidia is still recommended; Intel Arc and AMD are improving but lag in software support. Some speculate APUs with large unified memory may shift this landscape.