Complete hardware and software setup for running Deepseek-R1 locally

Performance and Practicality

  • The showcased build achieves ~6–8 tokens/second on Q8 R1, which many see as “usable but slow,” especially for reasoning models that generate long “thinking” traces before the final answer.
  • Several commenters say chat feels acceptable around 15 t/s, and code-assistant use starts feeling good closer to 30 t/s. At 6–8 t/s many expect noticeable friction and context/flow breaks.
  • Some are happy to run big, slow models in the background and wait, or use them for batch-like tasks; others view this as more of a tech demo than a practical daily driver.

Hardware Design and Bottlenecks

  • The rig’s core idea: huge DRAM capacity and bandwidth on dual EPYC sockets, no GPU, to fit the full 671B Q8 model.
  • Multiple people argue the true bottleneck is memory bandwidth, not raw FLOPs; reasoning models especially are considered “CPU-unfriendly.”
  • There’s debate over dual-socket benefits: the original thread suggests disabling NUMA groups to “double throughput,” but others note remote NUMA access is slower and llama.cpp’s NUMA support is currently suboptimal; a single high-bandwidth socket might even be faster until software improves.
  • Alternative builds are proposed (single-socket EPYC with 12x64GB, Threadripper, cheap dual-socket used servers), but many of these either can’t match the bandwidth or are untested hypotheses.
  • Mac hardware is discussed: Apple’s tightly integrated, non-upgradeable RAM is praised for bandwidth but criticized for caps like 192GB, which block full R1.

Quantization and Model Choices

  • The $6k build targets Q8 “full quality.” Others point to dynamic low-bit (≈2.5-bit) quantizations that reportedly perform well at ~212GB, suggesting cheaper rigs could run strong variants with less RAM.
  • Some users are satisfied with smaller DeepSeek-R1 1.5B/8B or v3 models on M1/M2 Macs or modest PCs, trading quality for speed and cost.

Local vs Cloud and Business Angle

  • One thread explores building a low-cost CPU cluster to commercially host large open models, claiming it could rival specialized inference clouds on cost and speed; others are skeptical of the hardware and bandwidth cost estimates.
  • Broader debate: will cheap local frontier-level models threaten GPU-heavy cloud economics (and Nvidia), or will demand and large-cloud moats (ops, legal, compliance, export control, copyright risk) keep hyperscalers dominant?

Access and Tooling

  • Multiple comments share non-logged-in mirrors (xcancel, Nitter, threadreader, Bluesky) due to dislike of X/Twitter’s UX.
  • Practical tips are traded on downloading the 700GB+ weights from Hugging Face (git LFS vs direct HTTPS), and on llama.cpp configuration and future NUMA optimizations.