2026-04-02

Google releases Gemma 4 open models

Model overview & capabilities

Gemma 4 adds “thinking”/reasoning traces, multimodal input (images; audio on E2B/E4B), and tool calling.
Lineup: small E2B/E4B (mobile‑focused, Gemma‑3n‑style architecture) plus 26B A4B MoE and 31B dense.
Long context (up to ~200k+ tokens on some setups) and strong performance on many public benchmarks; some users say reasoning feels notably advanced.

Licensing and variants

Released under Apache 2.0, seen as a major shift vs prior Gemma licenses and good for agents/BYOK.
Both base and instruction‑tuned (“‑it”) models are provided; “it” models are intended for assistant/chat use.

Local deployment & performance

Many reports of running Gemma 4 locally via llama.cpp, LM Studio, Ollama, MLX, Modular MAX, LiteRT‑LM, and others.
26B A4B MoE praised for high token/sec at modest VRAM (e.g., ~150 tok/s on 4090, strong speeds on M‑series Macs) and good fit for agent frameworks.
31B dense is noticeably slower but higher quality; can still run on 24–64GB setups with quantization.
On low‑power devices (e.g., Raspberry Pi 5) even E4B is very slow; on modern Macs it’s comfortably usable.

Quantization & tooling

Unsloth released GGUF “Dynamic 2.0” quants quickly; users report near‑full quality at 4‑bit with large memory savings.
Confusion for newcomers around model size vs quant level vs context length; tools like Unsloth Studio and llama.cpp auto‑sizing help.
Some interest in future QAT / NVFP4‑style variants and TurboQuant‑like KV compression.

Quality, benchmarks & comparisons

Consolidated benchmark tables show Gemma 4 31B roughly competitive with other large open models, but Qwen 3.5 often leads, especially on coding and some reasoning tests.
Several argue public benchmarks are heavily overfit/gamed; they trust human‑eval (e.g., Arena ELO) or private benchmarks more.
Others counter that private tests (e.g., ARC‑AGI 2) show Chinese models weaker and worry about training on test sets.

Use cases and early experiments

Reported uses: OCR + translation + embeddings for historical land records; PDF and table extraction; receipt/document tagging; spam filtering benchmarks; translation; RAG; code agents (Claude‑Code‑style workflows); local photo metadata and SVG/image generation.
Small E2B/E4B models impress some for on‑device multimodal tasks and SQL generation, but are weaker for code and complex reasoning than 26B/31B.

Concerns, bugs & limitations

Early issues with chat templates and tool‑calling in llama.cpp/clients caused broken behavior; fixes are landing but commenters warn against judging on day‑one bugs.
Some find “thinking” traces slow or theatrical and note hallucinations even when the model “pretends” to run scripts/commands.
Setup UX (especially on Windows) is still rough; users want simple installers and better defaults.

Related topics