A 30B Qwen model walks into a Raspberry Pi and runs in real time

Performance & “real time” on Raspberry Pi

  • OP’s claim clarified: on a Pi 5 (16GB), Qwen3‑30B‑A3B Q3_K_S (2.7 bpw) reaches ~8 tokens/sec while preserving ~94% of BF16 benchmark accuracy.
  • Several users reproduced results: one needed -c 4096 (shorter context) to avoid OOM, then saw ~6–8 tok/s generation and ~8–10 tok/s prompt processing; longer outputs dropped to ~4–6 tok/s.
  • Another reported ~3–4 tok/s with same model, unclear why. Others benchmarked similar speeds on low‑end x86 mini‑PCs and better speeds (15–40 tok/s) on desktop CPUs/GPUs.
  • Some note the reduced context (e.g. 4096) is a real limitation and can degrade answer quality for longer interactions.
  • Debate on “real time”: some use it loosely (“as fast as you can read”), others insist real‑time means bounded latency (e.g. ~10 tok/s including TTS) or sub‑30ms reactions.

Quantization, accuracy metrics & A3B MoE

  • Quantization discussed as per‑tensor, variable‑bit (“average bpw”) with surprisingly high benchmark retention.
  • “Accuracy” here means combined scores on GSM8K, MMLU, IFEVAL, and LiveCodeBench; link to vendor’s methodology is referenced.
  • Some feel calling it “30B” is misleading because it’s an A3B MoE model: only ~3B parameters are active per token, so memory bandwidth per token is closer to a 3–5B dense model, though total parameters are ~30B.
  • Explanations of MoE: a small “router” picks a subset of experts per token, so most weights are never fetched per step.

Local assistants & smart home visions

  • Strong interest in a privacy‑preserving, fully local “Alexa replacement”:
    • Cheap room devices (mic/speaker, maybe display) + a home server + pluggable inference boxes.
    • Desire for plug‑and‑play standards, no cloud accounts, and voice control over timers, weather, basic queries, and home automation.
  • Home Assistant + its Voice edition mentioned as close to this vision; others point to text‑based assistants and DIY agents.
  • Some want proactive, context‑aware local agents that listen to household conversations, identify problems, and later propose solutions—seen by some as exciting, by others as creepy.

HN summaries & AI in the comment stream

  • One user wants an HN front page with automatic LLM article summaries; others strongly oppose AI‑generated “slop” in comments.
  • Compromise suggestions: browser extensions/userscripts that call LLMs client‑side so HN itself stays human‑written.
  • Concern raised that over‑reliance on summaries may harm media literacy and flatten authors’ style and nuance.

Hardware constraints & inference accelerators

  • Several argue for ubiquitous, cheap inference chips/NPUs on all boards to offload LLM compute.
  • Others counter that memory capacity and bandwidth, not raw FLOPs, are the main bottlenecks for LLMs; just adding compute units doesn’t fix that.
  • NPUs already exist in many consumer devices (e.g. “Copilot+” PCs), but software support and real‑world use remain limited.

Model choice & usefulness of small models

  • Suggestions for exploring models via OpenRouter‑like services: cheap per‑token access to many models for comparison.
  • 30B‑class quantized models (e.g. Qwen3‑30B A3B) seen as a current sweet spot: not frontier‑level but often better than GPT‑4o and usable as basic coding agents when enough VRAM/RAM is available.
  • Smaller models (4–8B) recommended for tasks like translation or summarization on modest hardware.
  • Experiences with ultra‑small models (0.6–1B) are mixed: some find them surprisingly capable for narrow, structured tasks and “natural language sheen”; others say they’re effectively useless for anything that matters.

Edge / local inference use cases

  • Users discuss which tasks fit “slow but private” local inference:
    • Smart‑home analysis (Home Assistant logs, sensor data), anomaly or “interesting pattern” detection.
    • Long‑running, non‑critical background agents that don’t require perfect accuracy.
  • A few note that while edge demos are impressive, practical deployment faces messy constraints: thermals, power draw, RAM limits, and unclear real‑world need versus just using the cloud.