2026-06-05

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

Release cadence, naming, and confusion

Multiple Gemma 4 releases in weeks (base, assistant/MTP, 12B, QAT variants) create implementation churn for downstream apps.
Some find staggered releases “awkward” and confusing; others argue models are research outputs, so scientific naming and incremental release is appropriate.
The “E2B/E4B/A4B” naming and unclear memory requirements on Android are described as particularly opaque.

QAT vs standard quantization

Q4_0 “QAT” checkpoints are not just post-hoc 4‑bit quantizations; they are BF16 models trained to be robust to later 4‑bit quantization.
Unsloth provides 4–8 bit GGUFs of these QAT models and claims better results than naïve truncation of Google’s checkpoints.
There is confusion over why Google’s QAT Q4_0 vs Unsloth’s Q4_0 differ if they’re “just packing,” with “lattice alignment” mentioned as a factor.
One commenter notes QAT targets weight precision; it does not fix Gemma’s very large activations, which force high-precision KV cache and limit context length on local hardware.

Performance and use cases of small models

Some report good results with 2B/4B QAT models for web search orchestration, structured JSON, and simple recommendation/recipe-style tasks.
Others find E2B/E4B “too dumb” for factual or classification tasks (e.g., getting basic political facts wrong) and see them as niche without strong tool grounding.
Mixed experiences with tool use: some see looped or failed web calls in Google’s Edge Gallery; others report robust tool calling via third-party orchestration with careful prompting and retries.

Local inference on consumer devices

QAT 12B Q4_0 at ~6.7GB VRAM is praised for fitting into 8GB GPUs and 16GB laptops; some note Google’s own macOS Edge Gallery incorrectly marks 12B as unsupported.
Users are running Gemma 4 locally via llama.cpp, Unsloth Studio, LiteRT, LM Studio, MLX, and Ollama on laptops, desktops, and high-end phones.
There is enthusiasm for sub‑1GB text-only variants and on-device multimodal (image/audio) support, though expectations are modest for complex outputs.

MTP (multi-token prediction) and drafters

Google has released specialized “-assistant” MTP draft models (e.g., 26B/31B assistants).
Tooling support is still catching up: llama.cpp MTP support for Gemma 4 is in progress; some formats (safetensors vs GGUF/MLX) require user-side conversion.

Cloud vs local, privacy, and cost

Some see no need for local models given reliable access to GPT/Claude, and worry about resource use on personal machines.
Others emphasize:
- Cost savings for automation and pipelines using small models vs frontier cloud models.
- Control, privacy, and running models on airgapped or personal hardware.
Debate arises over whether it’s “ethical” or necessary for cloud providers to train on user data if users are also paying; positions are sharply divided.

Impact on efficiency and carbon footprint

One question asks if local/QAT work translates to greener cloud inference.
Response: consumer and server hardware differ substantially (e.g., TPUs, MoE ratios, caching), and big providers already push hard on optimization; it’s unclear how much public QAT work directly reduces cloud carbon usage.

Related topics