Google releases Gemma 4 open models

Model overview & capabilities

  • Gemma 4 adds “thinking”/reasoning traces, multimodal input (images; audio on E2B/E4B), and tool calling.
  • Lineup: small E2B/E4B (mobile‑focused, Gemma‑3n‑style architecture) plus 26B A4B MoE and 31B dense.
  • Long context (up to ~200k+ tokens on some setups) and strong performance on many public benchmarks; some users say reasoning feels notably advanced.

Licensing and variants

  • Released under Apache 2.0, seen as a major shift vs prior Gemma licenses and good for agents/BYOK.
  • Both base and instruction‑tuned (“‑it”) models are provided; “it” models are intended for assistant/chat use.

Local deployment & performance

  • Many reports of running Gemma 4 locally via llama.cpp, LM Studio, Ollama, MLX, Modular MAX, LiteRT‑LM, and others.
  • 26B A4B MoE praised for high token/sec at modest VRAM (e.g., ~150 tok/s on 4090, strong speeds on M‑series Macs) and good fit for agent frameworks.
  • 31B dense is noticeably slower but higher quality; can still run on 24–64GB setups with quantization.
  • On low‑power devices (e.g., Raspberry Pi 5) even E4B is very slow; on modern Macs it’s comfortably usable.

Quantization & tooling

  • Unsloth released GGUF “Dynamic 2.0” quants quickly; users report near‑full quality at 4‑bit with large memory savings.
  • Confusion for newcomers around model size vs quant level vs context length; tools like Unsloth Studio and llama.cpp auto‑sizing help.
  • Some interest in future QAT / NVFP4‑style variants and TurboQuant‑like KV compression.

Quality, benchmarks & comparisons

  • Consolidated benchmark tables show Gemma 4 31B roughly competitive with other large open models, but Qwen 3.5 often leads, especially on coding and some reasoning tests.
  • Several argue public benchmarks are heavily overfit/gamed; they trust human‑eval (e.g., Arena ELO) or private benchmarks more.
  • Others counter that private tests (e.g., ARC‑AGI 2) show Chinese models weaker and worry about training on test sets.

Use cases and early experiments

  • Reported uses: OCR + translation + embeddings for historical land records; PDF and table extraction; receipt/document tagging; spam filtering benchmarks; translation; RAG; code agents (Claude‑Code‑style workflows); local photo metadata and SVG/image generation.
  • Small E2B/E4B models impress some for on‑device multimodal tasks and SQL generation, but are weaker for code and complex reasoning than 26B/31B.

Concerns, bugs & limitations

  • Early issues with chat templates and tool‑calling in llama.cpp/clients caused broken behavior; fixes are landing but commenters warn against judging on day‑one bugs.
  • Some find “thinking” traces slow or theatrical and note hallucinations even when the model “pretends” to run scripts/commands.
  • Setup UX (especially on Windows) is still rough; users want simple installers and better defaults.