2025-03-06

DeepSeek-R1-671B-Q4_K_M with 1 or 2 Arc A770 on Xeon

Headline vs. Reality

Multiple commenters say the title is misleading: you are not truly “running DeepSeek-R1-671B on 1–2 A770s” in VRAM.
The guide’s early example is a 7B distilled model, but deeper in it discusses the full DeepSeek-R1 Q4_K_M with FlashMoE.
The actual recipe: ~380 GB of system RAM + 1–8 Intel Arc A770 GPUs + 500 GB disk. Most of the model resides in CPU memory; GPUs offload a smaller portion.

MoE Architecture & Active Parameters

DeepSeek V3/R1 is a sparse MoE: out of 256 experts per layer, K=8 routed experts plus 1 shared expert are active each forward pass.
“37B” refers to active parameters per token (experts + router + overhead), not size of a single expert.
If the same experts are selected for consecutive tokens, later tokens can reuse experts in VRAM and behave more like a 37B model in practice; if experts change frequently, CPU–GPU transfers dominate.

Implementation Details (llama.cpp / ipex-llm)

This builds on llama.cpp with Intel’s ipex-llm extensions for hybrid CPU–GPU MoE.
One commenter notes llama.cpp traditionally splits layers between CPU and GPU, so GPU speedups are gated by CPU layers; Intel says they add extra MoE-specific optimizations.
With a single A770, context length appears limited (~1024 tokens); more GPUs may allow longer context.

Performance & Benchmarks

Official TPS numbers for DeepSeek-R1 in this setup are sparse; only a claim of “>8 tokens/s” on a dual-socket 5th‑gen Xeon.
People criticize other large-CPU rigs (e.g., dual Epyc) that get 3–4 tok/s on reasoning models as effectively unusable for long think phases.
Others argue even slow local setups are valuable for development or for those prioritizing locality over speed.

Hardware, Cost, and Alternatives

Xeon is favored for many memory channels and PCIe lanes; consumer CPUs typically top out at 128–256 GB RAM.
Some argue 384 GB DDR4 is now relatively cheap; others note Intel likely used high-end DDR5 Xeons to hit reported speeds.
Debate over multi-GPU vs buying a single large-VRAM accelerator; multi-GPU is often tricky and not always faster.
Quantization tradeoffs: Q4 seen as weak for coding, Q5 as a sweet spot; very low-bit (~Q2) DeepSeek-R1/V2.5 quants can still be surprisingly capable, especially outside creative writing.
Ecosystem sentiment: for serious work, Nvidia is still recommended; Intel Arc and AMD are improving but lag in software support. Some speculate APUs with large unified memory may shift this landscape.

Related topics