Accelerating Gemma 4: faster inference with multi-token prediction drafters
Model quality and comparisons
- Many find Gemma 4 31B “fantastic” and Gemma 4 26B-A4B very fast and strong for general queries, though several say Qwen 3.5/3.6 remains better for coding and tool-calling.
- Some report Gemma 4 26B making more mistakes than Qwen and Gemma 3, and say 31B is more accurate but “horrendously” slow without MTP.
- Several note Gemma/Gemini often use far fewer tokens and finish tasks much faster than Qwen, which tends to produce long reasoning chains; trade-off framed as ~5–10% less quality vs much less time.
- There’s disagreement on “which is better”: some strongly prefer Qwen overall, others prefer Gemma’s speed, prose, and vision, and suggest routing tasks across models.
Speculative decoding / Multi-Token Prediction (MTP)
- MTP is described as a refined speculative decoding: a tiny “assistant” model proposes multiple future tokens; the large model verifies them in parallel.
- Google’s Gemma assistants are extremely small (e.g., ~78M params for E4B) and reuse the main model’s activations and KV cache, reducing overhead versus earlier MTP setups.
- Users report large throughput gains locally (often 2–2.5× TPS) with little or no observed quality loss, though some argue acceptance rules can subtly change output distributions.
- Conceptual analogies include branch prediction and batching against your “own speculated future path”; works best when inference is memory-bandwidth-bound and you have spare compute.
Tooling, implementations, and speed reports
- MTP support is being added to llama.cpp, vLLM, Ollama, MLX, etc. Early patches show big speedups for Qwen and Gemma on consumer and older data-center GPUs.
- Using Gemma 4 with MTP currently requires paired “-assistant” models or combined formats (e.g., GGUF with both heads); some tools (LM Studio, oMLX) lag or have quirks.
- Reports include dense 26B/31B Gemma and Qwen 27B jumping from ~20 t/s to 40–55 t/s or more; occasionally prefill slows while decode speeds soar.
Local vs cloud, product strategy, and UX
- Strong enthusiasm for local models: decent 20–50+ t/s on sub-$1k GPUs; offline tools highlighted as a feature (privacy, zero tracking).
- Some confusion and frustration over Google’s product maze (Gemini, Vertex, AI Studio, Edge) and difficulty discovering how to pay for Gemma 4 via existing accounts.
- Mixed reviews of Gemini pricing, quotas, and CLI/dev tools: some say $15–20/month plans enable “all-day” coding; others hit limits fast and perceive quality/UX regressions.
- Several frame Google’s broader strategy as prioritizing efficiency and massive-scale deployment (Flash, Gemma) over pushing the largest possible frontier models.