2026-05-05

Accelerating Gemma 4: faster inference with multi-token prediction drafters

Model quality and comparisons

Many find Gemma 4 31B “fantastic” and Gemma 4 26B-A4B very fast and strong for general queries, though several say Qwen 3.5/3.6 remains better for coding and tool-calling.
Some report Gemma 4 26B making more mistakes than Qwen and Gemma 3, and say 31B is more accurate but “horrendously” slow without MTP.
Several note Gemma/Gemini often use far fewer tokens and finish tasks much faster than Qwen, which tends to produce long reasoning chains; trade-off framed as ~5–10% less quality vs much less time.
There’s disagreement on “which is better”: some strongly prefer Qwen overall, others prefer Gemma’s speed, prose, and vision, and suggest routing tasks across models.

Speculative decoding / Multi-Token Prediction (MTP)

MTP is described as a refined speculative decoding: a tiny “assistant” model proposes multiple future tokens; the large model verifies them in parallel.
Google’s Gemma assistants are extremely small (e.g., ~78M params for E4B) and reuse the main model’s activations and KV cache, reducing overhead versus earlier MTP setups.
Users report large throughput gains locally (often 2–2.5× TPS) with little or no observed quality loss, though some argue acceptance rules can subtly change output distributions.
Conceptual analogies include branch prediction and batching against your “own speculated future path”; works best when inference is memory-bandwidth-bound and you have spare compute.

Tooling, implementations, and speed reports

MTP support is being added to llama.cpp, vLLM, Ollama, MLX, etc. Early patches show big speedups for Qwen and Gemma on consumer and older data-center GPUs.
Using Gemma 4 with MTP currently requires paired “-assistant” models or combined formats (e.g., GGUF with both heads); some tools (LM Studio, oMLX) lag or have quirks.
Reports include dense 26B/31B Gemma and Qwen 27B jumping from ~20 t/s to 40–55 t/s or more; occasionally prefill slows while decode speeds soar.

Local vs cloud, product strategy, and UX

Strong enthusiasm for local models: decent 20–50+ t/s on sub-$1k GPUs; offline tools highlighted as a feature (privacy, zero tracking).
Some confusion and frustration over Google’s product maze (Gemini, Vertex, AI Studio, Edge) and difficulty discovering how to pay for Gemma 4 via existing accounts.
Mixed reviews of Gemini pricing, quotas, and CLI/dev tools: some say $15–20/month plans enable “all-day” coding; others hit limits fast and perceive quality/UX regressions.
Several frame Google’s broader strategy as prioritizing efficiency and massive-scale deployment (Flash, Gemma) over pushing the largest possible frontier models.

Related topics