2025-02-01

How to Run DeepSeek R1 671B Locally on a $2000 EPYC Server

Hardware setup & observed performance

Article’s build: single-socket EPYC with 512 GB DDR4, running DeepSeek-R1 671B Q4 via Ollama, reported at ~3.5–4.25 tokens/sec (TPS).
A dual-socket EPYC / 768 GB RAM setup (about $6k) reportedly runs the original Q8 model at ~6–8 TPS.
Users running extreme Unsloth quantizations (1.58–2 bit) from NVMe on much smaller consumer systems report ~0.11–0.16 TPS, confirming “model in RAM” is the dominant bottleneck.
Another home server with 2×3090 + 192 GB DDR5 gets ~4–5 TPS on the small dynamic-quant variant (4K context).
Power measurements from the article’s rig: ~60 W idle, ~260 W under load, lower than some commenters’ 1 kW assumptions.

Storage, memory bandwidth & RAID

Some suggest RAID0 across multiple NVMe drives to speed initial model load; with proper striping and IO alignment, >12 GB/s reads have been seen on similar platforms.
Others caution that naïve mdraid/ZFS setups can incur noticeable CPU and RAM overhead unless carefully tuned.
Discussion emphasizes memory channels and bandwidth over raw DDR speed; more channels often beat higher MHz.

Cost, efficiency & cloud vs local

Several calculations argue API access (~$2/MTok for R1-class models) is far cheaper than buying and powering a $2k+ box, unless utilization and/or electricity are very favorable.
Counterargument: privacy, policy constraints (no data off-prem), and general homelab utility (hosting many services) can justify the CapEx.
Some think local R1 at 3–4 TPS is already “usable” for non-interactive tasks; others consider anything <10 TPS impractical once the “thinking” phase is included.

Privacy, security & local vs cloud

Strong recurring theme: privacy is the primary reason to self-host, especially for proprietary code or customer data.
Skepticism that cloud contracts meaningfully protect against state-level surveillance; on-prem, even if poorly secured, at least requires more effort to access.
Local DeepSeek model itself is seen as safe; concerns apply mainly to using DeepSeek’s hosted service.

Quantization, MoE and model choices

Clarifications that the showcased build runs Q4 quantization, not full 8-bit; the Ollama 671B (~400 GB) is smaller than the ~700 GB Hugging Face weights.
MoE architecture (only ~37B active parameters per token) is what makes CPU inference at all feasible; a dense 671B would be far worse.
Many argue that, in practice, smaller 7–70B models (often on a single RTX 3090/4090 or M-series Mac) give far better speed/quality tradeoffs for most users.

Skepticism about the article & components

Some accuse the post of being “affiliate-linky,” with underpriced or mismatched parts (e.g., 8×32 GB kit linked while claiming 512 GB, EPYC prices off by ~2×).
Others defend it as genuine, useful content that happens to use affiliate links, but agree that RAM and CPU pricing/specs should be scrutinized.

Related topics