2026-01-06

A 30B Qwen model walks into a Raspberry Pi and runs in real time

Performance & “real time” on Raspberry Pi

OP’s claim clarified: on a Pi 5 (16GB), Qwen3‑30B‑A3B Q3_K_S (2.7 bpw) reaches ~8 tokens/sec while preserving ~94% of BF16 benchmark accuracy.
Several users reproduced results: one needed -c 4096 (shorter context) to avoid OOM, then saw ~6–8 tok/s generation and ~8–10 tok/s prompt processing; longer outputs dropped to ~4–6 tok/s.
Another reported ~3–4 tok/s with same model, unclear why. Others benchmarked similar speeds on low‑end x86 mini‑PCs and better speeds (15–40 tok/s) on desktop CPUs/GPUs.
Some note the reduced context (e.g. 4096) is a real limitation and can degrade answer quality for longer interactions.
Debate on “real time”: some use it loosely (“as fast as you can read”), others insist real‑time means bounded latency (e.g. ~10 tok/s including TTS) or sub‑30ms reactions.

Quantization, accuracy metrics & A3B MoE

Quantization discussed as per‑tensor, variable‑bit (“average bpw”) with surprisingly high benchmark retention.
“Accuracy” here means combined scores on GSM8K, MMLU, IFEVAL, and LiveCodeBench; link to vendor’s methodology is referenced.
Some feel calling it “30B” is misleading because it’s an A3B MoE model: only ~3B parameters are active per token, so memory bandwidth per token is closer to a 3–5B dense model, though total parameters are ~30B.
Explanations of MoE: a small “router” picks a subset of experts per token, so most weights are never fetched per step.

Local assistants & smart home visions

Strong interest in a privacy‑preserving, fully local “Alexa replacement”:
- Cheap room devices (mic/speaker, maybe display) + a home server + pluggable inference boxes.
- Desire for plug‑and‑play standards, no cloud accounts, and voice control over timers, weather, basic queries, and home automation.
Home Assistant + its Voice edition mentioned as close to this vision; others point to text‑based assistants and DIY agents.
Some want proactive, context‑aware local agents that listen to household conversations, identify problems, and later propose solutions—seen by some as exciting, by others as creepy.

HN summaries & AI in the comment stream

One user wants an HN front page with automatic LLM article summaries; others strongly oppose AI‑generated “slop” in comments.
Compromise suggestions: browser extensions/userscripts that call LLMs client‑side so HN itself stays human‑written.
Concern raised that over‑reliance on summaries may harm media literacy and flatten authors’ style and nuance.

Hardware constraints & inference accelerators

Several argue for ubiquitous, cheap inference chips/NPUs on all boards to offload LLM compute.
Others counter that memory capacity and bandwidth, not raw FLOPs, are the main bottlenecks for LLMs; just adding compute units doesn’t fix that.
NPUs already exist in many consumer devices (e.g. “Copilot+” PCs), but software support and real‑world use remain limited.

Model choice & usefulness of small models

Suggestions for exploring models via OpenRouter‑like services: cheap per‑token access to many models for comparison.
30B‑class quantized models (e.g. Qwen3‑30B A3B) seen as a current sweet spot: not frontier‑level but often better than GPT‑4o and usable as basic coding agents when enough VRAM/RAM is available.
Smaller models (4–8B) recommended for tasks like translation or summarization on modest hardware.
Experiences with ultra‑small models (0.6–1B) are mixed: some find them surprisingly capable for narrow, structured tasks and “natural language sheen”; others say they’re effectively useless for anything that matters.

Edge / local inference use cases

Users discuss which tasks fit “slow but private” local inference:
- Smart‑home analysis (Home Assistant logs, sensor data), anomaly or “interesting pattern” detection.
- Long‑running, non‑critical background agents that don’t require perfect accuracy.
A few note that while edge demos are impressive, practical deployment faces messy constraints: thermals, power draw, RAM limits, and unclear real‑world need versus just using the cloud.

Related topics