A 30B Qwen model walks into a Raspberry Pi and runs in real time
Performance & “real time” on Raspberry Pi
- OP’s claim clarified: on a Pi 5 (16GB), Qwen3‑30B‑A3B Q3_K_S (2.7 bpw) reaches ~8 tokens/sec while preserving ~94% of BF16 benchmark accuracy.
- Several users reproduced results: one needed
-c 4096(shorter context) to avoid OOM, then saw ~6–8 tok/s generation and ~8–10 tok/s prompt processing; longer outputs dropped to ~4–6 tok/s. - Another reported ~3–4 tok/s with same model, unclear why. Others benchmarked similar speeds on low‑end x86 mini‑PCs and better speeds (15–40 tok/s) on desktop CPUs/GPUs.
- Some note the reduced context (e.g. 4096) is a real limitation and can degrade answer quality for longer interactions.
- Debate on “real time”: some use it loosely (“as fast as you can read”), others insist real‑time means bounded latency (e.g. ~10 tok/s including TTS) or sub‑30ms reactions.
Quantization, accuracy metrics & A3B MoE
- Quantization discussed as per‑tensor, variable‑bit (“average bpw”) with surprisingly high benchmark retention.
- “Accuracy” here means combined scores on GSM8K, MMLU, IFEVAL, and LiveCodeBench; link to vendor’s methodology is referenced.
- Some feel calling it “30B” is misleading because it’s an A3B MoE model: only ~3B parameters are active per token, so memory bandwidth per token is closer to a 3–5B dense model, though total parameters are ~30B.
- Explanations of MoE: a small “router” picks a subset of experts per token, so most weights are never fetched per step.
Local assistants & smart home visions
- Strong interest in a privacy‑preserving, fully local “Alexa replacement”:
- Cheap room devices (mic/speaker, maybe display) + a home server + pluggable inference boxes.
- Desire for plug‑and‑play standards, no cloud accounts, and voice control over timers, weather, basic queries, and home automation.
- Home Assistant + its Voice edition mentioned as close to this vision; others point to text‑based assistants and DIY agents.
- Some want proactive, context‑aware local agents that listen to household conversations, identify problems, and later propose solutions—seen by some as exciting, by others as creepy.
HN summaries & AI in the comment stream
- One user wants an HN front page with automatic LLM article summaries; others strongly oppose AI‑generated “slop” in comments.
- Compromise suggestions: browser extensions/userscripts that call LLMs client‑side so HN itself stays human‑written.
- Concern raised that over‑reliance on summaries may harm media literacy and flatten authors’ style and nuance.
Hardware constraints & inference accelerators
- Several argue for ubiquitous, cheap inference chips/NPUs on all boards to offload LLM compute.
- Others counter that memory capacity and bandwidth, not raw FLOPs, are the main bottlenecks for LLMs; just adding compute units doesn’t fix that.
- NPUs already exist in many consumer devices (e.g. “Copilot+” PCs), but software support and real‑world use remain limited.
Model choice & usefulness of small models
- Suggestions for exploring models via OpenRouter‑like services: cheap per‑token access to many models for comparison.
- 30B‑class quantized models (e.g. Qwen3‑30B A3B) seen as a current sweet spot: not frontier‑level but often better than GPT‑4o and usable as basic coding agents when enough VRAM/RAM is available.
- Smaller models (4–8B) recommended for tasks like translation or summarization on modest hardware.
- Experiences with ultra‑small models (0.6–1B) are mixed: some find them surprisingly capable for narrow, structured tasks and “natural language sheen”; others say they’re effectively useless for anything that matters.
Edge / local inference use cases
- Users discuss which tasks fit “slow but private” local inference:
- Smart‑home analysis (Home Assistant logs, sensor data), anomaly or “interesting pattern” detection.
- Long‑running, non‑critical background agents that don’t require perfect accuracy.
- A few note that while edge demos are impressive, practical deployment faces messy constraints: thermals, power draw, RAM limits, and unclear real‑world need versus just using the cloud.