2025-02-15

Deepseek R1 Distill 8B Q40 on 4 x Raspberry Pi 5

Raspberry Pi clusters vs. alternative hardware

Many see the 4×RPi5 setup as a “modern Beowulf cluster” demo, but argue that for similar money a used 1U Epyc server or a few Ryzen/mini‑PCs deliver far more performance, PCIe, and normal firmware.
Others point out Pis are quiet, low‑TDP, small, and easier to power/cool in a home than 180W+ servers or GPU rigs; noise from 1U servers is a real blocker for apartments.
Several note that the same distributed-llama approach works on x86 boxes running Debian, so Pis are more about accessibility and fun than optimal perf/$.

Power, hosting, and cost debates

Disagreement over how hard it is to “just run a 1U server at home”: some say residential power and bills are limiting, others say a few hundred watts is normal for gaming PCs.
Confusion about kW vs kWh and colocation pricing is called out; many emphasize that idle draw of small servers/mini‑PCs can be only a few watts.

Distilled R1 vs “real” R1 and naming

Thread stresses these are DeepSeek‑R1 distill models (e.g., Distill‑Llama‑8B), not the full 604B reasoning model.
Pushback that calling them “DeepSeek R1” is misleading; they’re Llama/Qwen finetunes with R1‑style chain‑of‑thought, not from‑scratch distilled replicas.
Distillation is described as training a smaller model on prompts + outputs (and ideally token probabilities) from a larger “teacher”; quantization is compressing weights to fewer bits.
Some note reports that DeepSeek itself may be (partly) distilled from OpenAI models, but that remains contested.

Performance, tokens/sec, and use cases

Skepticism about reported tok/s, since short demos hide slowdown at long context lengths; even Epyc drops to a few tok/s at 8–16k tokens.
Critics ask who wants a reasoning model at very low speed; others say many tasks are non‑interactive: background agents, CI, nightly jobs, home automation watchers.
Key interest is the distributed inference itself—sharding a model across CPUs over Ethernet—even though scaling is non‑linear and quickly bottlenecked by interconnect.

Bias, censorship, and responsible AI

Multiple comments report strongly pro‑China behavior and censorship around topics like Tiananmen in DeepSeek’s ecosystem, with reasoning traces that explicitly mention needing to follow Chinese guidelines.
Some see this as intentional propaganda; others frame it as dataset/policy bias. There’s disagreement over how much this should dominate discussion versus pure technical merit.

Memory, quantization, and hardware limits

Back‑of‑envelope: 8B @ Q4 ≈ 4 GB plus overhead; 8 GB Pis can fit such a model, 16 GB helps only for larger contexts/models.
Discussion of tradeoffs: more parameters at lower precision vs fewer at full precision; memory bandwidth vs compute‑bound regimes, especially for small models.

Tooling and packaging

People highlight Ollama, llama.cpp, MLX, Home Assistant, etc., and wish for an “apt‑get install” LLM stack and an Alexa‑like local smart speaker product.
Packaging work is noted as volunteer‑heavy; complaints without contributions won’t fix missing Debian packages.

Related topics