Deepseek R1 Distill 8B Q40 on 4 x Raspberry Pi 5

Raspberry Pi clusters vs. alternative hardware

  • Many see the 4×RPi5 setup as a “modern Beowulf cluster” demo, but argue that for similar money a used 1U Epyc server or a few Ryzen/mini‑PCs deliver far more performance, PCIe, and normal firmware.
  • Others point out Pis are quiet, low‑TDP, small, and easier to power/cool in a home than 180W+ servers or GPU rigs; noise from 1U servers is a real blocker for apartments.
  • Several note that the same distributed-llama approach works on x86 boxes running Debian, so Pis are more about accessibility and fun than optimal perf/$.

Power, hosting, and cost debates

  • Disagreement over how hard it is to “just run a 1U server at home”: some say residential power and bills are limiting, others say a few hundred watts is normal for gaming PCs.
  • Confusion about kW vs kWh and colocation pricing is called out; many emphasize that idle draw of small servers/mini‑PCs can be only a few watts.

Distilled R1 vs “real” R1 and naming

  • Thread stresses these are DeepSeek‑R1 distill models (e.g., Distill‑Llama‑8B), not the full 604B reasoning model.
  • Pushback that calling them “DeepSeek R1” is misleading; they’re Llama/Qwen finetunes with R1‑style chain‑of‑thought, not from‑scratch distilled replicas.
  • Distillation is described as training a smaller model on prompts + outputs (and ideally token probabilities) from a larger “teacher”; quantization is compressing weights to fewer bits.
  • Some note reports that DeepSeek itself may be (partly) distilled from OpenAI models, but that remains contested.

Performance, tokens/sec, and use cases

  • Skepticism about reported tok/s, since short demos hide slowdown at long context lengths; even Epyc drops to a few tok/s at 8–16k tokens.
  • Critics ask who wants a reasoning model at very low speed; others say many tasks are non‑interactive: background agents, CI, nightly jobs, home automation watchers.
  • Key interest is the distributed inference itself—sharding a model across CPUs over Ethernet—even though scaling is non‑linear and quickly bottlenecked by interconnect.

Bias, censorship, and responsible AI

  • Multiple comments report strongly pro‑China behavior and censorship around topics like Tiananmen in DeepSeek’s ecosystem, with reasoning traces that explicitly mention needing to follow Chinese guidelines.
  • Some see this as intentional propaganda; others frame it as dataset/policy bias. There’s disagreement over how much this should dominate discussion versus pure technical merit.

Memory, quantization, and hardware limits

  • Back‑of‑envelope: 8B @ Q4 ≈ 4 GB plus overhead; 8 GB Pis can fit such a model, 16 GB helps only for larger contexts/models.
  • Discussion of tradeoffs: more parameters at lower precision vs fewer at full precision; memory bandwidth vs compute‑bound regimes, especially for small models.

Tooling and packaging

  • People highlight Ollama, llama.cpp, MLX, Home Assistant, etc., and wish for an “apt‑get install” LLM stack and an Alexa‑like local smart speaker product.
  • Packaging work is noted as volunteer‑heavy; complaints without contributions won’t fix missing Debian packages.