2025-01-29

Complete hardware and software setup for running Deepseek-R1 locally

Performance and Practicality

The showcased build achieves ~6–8 tokens/second on Q8 R1, which many see as “usable but slow,” especially for reasoning models that generate long “thinking” traces before the final answer.
Several commenters say chat feels acceptable around 15 t/s, and code-assistant use starts feeling good closer to 30 t/s. At 6–8 t/s many expect noticeable friction and context/flow breaks.
Some are happy to run big, slow models in the background and wait, or use them for batch-like tasks; others view this as more of a tech demo than a practical daily driver.

Hardware Design and Bottlenecks

The rig’s core idea: huge DRAM capacity and bandwidth on dual EPYC sockets, no GPU, to fit the full 671B Q8 model.
Multiple people argue the true bottleneck is memory bandwidth, not raw FLOPs; reasoning models especially are considered “CPU-unfriendly.”
There’s debate over dual-socket benefits: the original thread suggests disabling NUMA groups to “double throughput,” but others note remote NUMA access is slower and llama.cpp’s NUMA support is currently suboptimal; a single high-bandwidth socket might even be faster until software improves.
Alternative builds are proposed (single-socket EPYC with 12x64GB, Threadripper, cheap dual-socket used servers), but many of these either can’t match the bandwidth or are untested hypotheses.
Mac hardware is discussed: Apple’s tightly integrated, non-upgradeable RAM is praised for bandwidth but criticized for caps like 192GB, which block full R1.

Quantization and Model Choices

The $6k build targets Q8 “full quality.” Others point to dynamic low-bit (≈2.5-bit) quantizations that reportedly perform well at ~212GB, suggesting cheaper rigs could run strong variants with less RAM.
Some users are satisfied with smaller DeepSeek-R1 1.5B/8B or v3 models on M1/M2 Macs or modest PCs, trading quality for speed and cost.

Local vs Cloud and Business Angle

One thread explores building a low-cost CPU cluster to commercially host large open models, claiming it could rival specialized inference clouds on cost and speed; others are skeptical of the hardware and bandwidth cost estimates.
Broader debate: will cheap local frontier-level models threaten GPU-heavy cloud economics (and Nvidia), or will demand and large-cloud moats (ops, legal, compliance, export control, copyright risk) keep hyperscalers dominant?

Access and Tooling

Multiple comments share non-logged-in mirrors (xcancel, Nitter, threadreader, Bluesky) due to dislike of X/Twitter’s UX.
Practical tips are traded on downloading the 700GB+ weights from Hugging Face (git LFS vs direct HTTPS), and on llama.cpp configuration and future NUMA optimizations.

Related topics