I ran Gemma 4 as a local model in Codex CLI
Overall impressions of Gemma 4 for local use
- Many commenters report Gemma 4 (especially 26B and 31B) as the first local model that feels “good enough” for serious coding and doc navigation.
- Several say it’s close to GPT-OSS tier for one-shot coding, but weaker in iterative / agentic workflows and non-coding reasoning.
- Some users find it flails on moderately complex codebases or specific tasks where other models succeed.
Hardware, engines, and performance
- Successful setups span:
- Nvidia 4090 / dual 4090, 3090, GB10/Spark, 24–128 GB+ RAM.
- Mac M1/M3/M4/M5 Pro/Max/Ultra with 16–64+ GB RAM.
- AMD RX 7900 XTX, CPU-only 64 GB machines.
- Mac M5 Pro reported as ~8x faster tokens/s vs M4 Pro for the same MoE Q4 model; unclear how much is CPU vs RAM.
- Some consider Macs poor ROI for inference vs Nvidia GPUs; others report great real-world speed with MLX/LM Studio when supported.
- Engines tried: llama.cpp, LM Studio, Ollama, vLLM, OpenWebUI. Opinions differ: some say Ollama is worst; others find vLLM finicky and prefer llama.cpp.
Quantization, context, and quality
- Strong consensus: for coding, high-quality quants (Q6_K, Q8_0) are much better; heavily compressed (Q4) hurts reliability.
- Advice: choose the highest quant your memory can handle, even if slower.
- Larger context windows (64k–512k) are possible but can constrain speed or memory; some offload MoE to CPU to trade speed for context.
Coding, tool calling, and agents
- Tool/function calling remains a weak spot: models get stuck in loops, miscall tools, or fail follow-up calls.
- Newer tokenizer/chat templates reportedly improve Gemma 4 tool use, but results are mixed.
- Some pair Gemma 4 with lighter agents (e.g., Pi) for lower overhead, or use draft models / speculative decoding for speed.
- Emerging pattern: local Gemma handles bulk experiments and small refactors, cloud frontier models handle architecture and hard bugs.
Safety, censorship, and uncensored variants
- One thread criticizes Gemma 4’s “censorship,” especially on medical questions, arguing that best-effort answers are needed offline.
- Others defend refusals as safer than potentially wrong high-stakes advice.
- Several note that “uncensored” or “abliterated” Gemma variants exist and can be used instead.
Comparisons and alternatives
- Qwen 3.5, GLM Flash, some distilled Qwen-based models, and other open models are cited as stronger in some coding or tool-calling tasks.
- Benchmarks shared in the thread show Gemma 4 26B-A4B exceptionally strong in one-shot coding but weaker in agentic scenarios and large-context tasks.