2025-09-06

Qwen3 30B A3B Hits 13 token/s on 4xRaspberry Pi 5

Technical approach and scaling

The setup uses distributed-llama with tensor parallelism across Raspberry Pi 5s; each node holds a shard of the model and synchronizes over Ethernet.
Scaling is constrained: max nodes ≈ number of KV heads; current implementation requires 2^n nodes and one node per attention head.
People question how well performance would scale beyond 4 Pis (e.g., 40 Pis), expecting diminishing returns due to network latency, NUMA-like bottlenecks, and synchronization overhead.
Some ask about more advanced networking (RoCE, Ultra Ethernet), but there’s no indication it’s currently used.

Performance vs hardware cost

Several commenters find 13 tok/s for ~$300–500 in Pi hardware “underwhelming,” suggesting used GPUs, used mini PCs, or old Xeon/Ryzen boxes yield better cost/performance.
Multiple comparisons favor:
- Used Apple Silicon (M1/M3/M4) with large unified memory as a strong local-inference option.
- New Ryzen AI/Strix Halo mini PCs with up to 128GB unified RAM as another path, though bandwidth limitations are noted.
- Cheap RK3588 boards (Orange Pi, Rock 5) offering competitive or better tokens/s than Pi 5 for some models.
Others note that GPUs still dominate raw performance, but are expensive, power-hungry, and VRAM-limited at consumer price points.

Local models, capability, and hallucinations

Many see local models like Qwen3-30B A3B as “good enough” for many tasks, comparable to last year’s proprietary SOTA.
There’s debate on whether “less capable” models are worthwhile for developer assistants:
- Some argue only top-tier models avoid subtle technical debt and poor abstractions.
- Others report real value from smaller coder models (4–15B) as fast local coding aids.
Hallucinations are seen as the main blocker for “killer apps.” Proposed mitigations include RAG and agentic setups that validate outputs (especially clear in coding), but commenters note this is harder in non-code domains and far from solved.

Consumer demand and killer apps

Opinions diverge on whether consumers care about local AI:
- One camp says hardware is ahead of use cases; people “don’t know what they want yet” and killer apps are missing.
- Another argues people have been heavily exposed to AI and largely don’t want more of the same (meeting notes, coding agents).

Children’s toys and ethics

Some are excited by Pi-scale LLMs enabling offline, story-remembering, interactive toys—likened to sci‑fi artifacts.
Others strongly oppose LLMs in kids’ toys, citing parallels with leaving children alone with strangers and concerns over shaping cognition and social norms.
A middle view emphasizes “thoughtful design” and intentionality in how children interact with AI, rather than blanket enthusiasm or rejection.

Hobbyist and cluster culture

Several acknowledge Pi clusters as more “proof-of-concept” or tinkering platforms than practical inference hardware.
Many hobbyists accumulate multiple Pis or SBCs from unfinished projects; repurposing them for distributed inference is seen as fun, if not strictly rational.
There’s recognition that for serious, cost‑sensitive workloads, used desktops, mini PCs, or a single strong machine often beat small ARM clusters.

Enterprise and labor implications

One long comment argues that even modest-speed, cheap local LLMs can automate large fractions of structured white‑collar tasks documented in procedures and job aids.
This view sees near-term disruption in “mind-numbingly tedious” office work, with human‑in‑the‑loop oversight, and raises questions about future work hours and the relative value of “embodied” service jobs that can’t be automated.

Related topics