Google releases Gemma 4 open models
Model overview & capabilities
- Gemma 4 adds “thinking”/reasoning traces, multimodal input (images; audio on E2B/E4B), and tool calling.
- Lineup: small E2B/E4B (mobile‑focused, Gemma‑3n‑style architecture) plus 26B A4B MoE and 31B dense.
- Long context (up to ~200k+ tokens on some setups) and strong performance on many public benchmarks; some users say reasoning feels notably advanced.
Licensing and variants
- Released under Apache 2.0, seen as a major shift vs prior Gemma licenses and good for agents/BYOK.
- Both base and instruction‑tuned (“‑it”) models are provided; “it” models are intended for assistant/chat use.
Local deployment & performance
- Many reports of running Gemma 4 locally via llama.cpp, LM Studio, Ollama, MLX, Modular MAX, LiteRT‑LM, and others.
- 26B A4B MoE praised for high token/sec at modest VRAM (e.g., ~150 tok/s on 4090, strong speeds on M‑series Macs) and good fit for agent frameworks.
- 31B dense is noticeably slower but higher quality; can still run on 24–64GB setups with quantization.
- On low‑power devices (e.g., Raspberry Pi 5) even E4B is very slow; on modern Macs it’s comfortably usable.
Quantization & tooling
- Unsloth released GGUF “Dynamic 2.0” quants quickly; users report near‑full quality at 4‑bit with large memory savings.
- Confusion for newcomers around model size vs quant level vs context length; tools like Unsloth Studio and llama.cpp auto‑sizing help.
- Some interest in future QAT / NVFP4‑style variants and TurboQuant‑like KV compression.
Quality, benchmarks & comparisons
- Consolidated benchmark tables show Gemma 4 31B roughly competitive with other large open models, but Qwen 3.5 often leads, especially on coding and some reasoning tests.
- Several argue public benchmarks are heavily overfit/gamed; they trust human‑eval (e.g., Arena ELO) or private benchmarks more.
- Others counter that private tests (e.g., ARC‑AGI 2) show Chinese models weaker and worry about training on test sets.
Use cases and early experiments
- Reported uses: OCR + translation + embeddings for historical land records; PDF and table extraction; receipt/document tagging; spam filtering benchmarks; translation; RAG; code agents (Claude‑Code‑style workflows); local photo metadata and SVG/image generation.
- Small E2B/E4B models impress some for on‑device multimodal tasks and SQL generation, but are weaker for code and complex reasoning than 26B/31B.
Concerns, bugs & limitations
- Early issues with chat templates and tool‑calling in llama.cpp/clients caused broken behavior; fixes are landing but commenters warn against judging on day‑one bugs.
- Some find “thinking” traces slow or theatrical and note hallucinations even when the model “pretends” to run scripts/commands.
- Setup UX (especially on Windows) is still rough; users want simple installers and better defaults.