DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon

Headline vs. Reality

  • Multiple commenters say the title is misleading: you are not truly “running DeepSeek-R1-671B on 1–2 A770s” in VRAM.
  • The guide’s early example is a 7B distilled model, but deeper in it discusses the full DeepSeek-R1 Q4_K_M with FlashMoE.
  • The actual recipe: ~380 GB of system RAM + 1–8 Intel Arc A770 GPUs + 500 GB disk. Most of the model resides in CPU memory; GPUs offload a smaller portion.

MoE Architecture & Active Parameters

  • DeepSeek V3/R1 is a sparse MoE: out of 256 experts per layer, K=8 routed experts plus 1 shared expert are active each forward pass.
  • “37B” refers to active parameters per token (experts + router + overhead), not size of a single expert.
  • If the same experts are selected for consecutive tokens, later tokens can reuse experts in VRAM and behave more like a 37B model in practice; if experts change frequently, CPU–GPU transfers dominate.

Implementation Details (llama.cpp / ipex-llm)

  • This builds on llama.cpp with Intel’s ipex-llm extensions for hybrid CPU–GPU MoE.
  • One commenter notes llama.cpp traditionally splits layers between CPU and GPU, so GPU speedups are gated by CPU layers; Intel says they add extra MoE-specific optimizations.
  • With a single A770, context length appears limited (~1024 tokens); more GPUs may allow longer context.

Performance & Benchmarks

  • Official TPS numbers for DeepSeek-R1 in this setup are sparse; only a claim of “>8 tokens/s” on a dual-socket 5th‑gen Xeon.
  • People criticize other large-CPU rigs (e.g., dual Epyc) that get 3–4 tok/s on reasoning models as effectively unusable for long think phases.
  • Others argue even slow local setups are valuable for development or for those prioritizing locality over speed.

Hardware, Cost, and Alternatives

  • Xeon is favored for many memory channels and PCIe lanes; consumer CPUs typically top out at 128–256 GB RAM.
  • Some argue 384 GB DDR4 is now relatively cheap; others note Intel likely used high-end DDR5 Xeons to hit reported speeds.
  • Debate over multi-GPU vs buying a single large-VRAM accelerator; multi-GPU is often tricky and not always faster.
  • Quantization tradeoffs: Q4 seen as weak for coding, Q5 as a sweet spot; very low-bit (~Q2) DeepSeek-R1/V2.5 quants can still be surprisingly capable, especially outside creative writing.
  • Ecosystem sentiment: for serious work, Nvidia is still recommended; Intel Arc and AMD are improving but lag in software support. Some speculate APUs with large unified memory may shift this landscape.