2026-04-12

I ran Gemma 4 as a local model in Codex CLI

Overall impressions of Gemma 4 for local use

Many commenters report Gemma 4 (especially 26B and 31B) as the first local model that feels “good enough” for serious coding and doc navigation.
Several say it’s close to GPT-OSS tier for one-shot coding, but weaker in iterative / agentic workflows and non-coding reasoning.
Some users find it flails on moderately complex codebases or specific tasks where other models succeed.

Hardware, engines, and performance

Successful setups span:
- Nvidia 4090 / dual 4090, 3090, GB10/Spark, 24–128 GB+ RAM.
- Mac M1/M3/M4/M5 Pro/Max/Ultra with 16–64+ GB RAM.
- AMD RX 7900 XTX, CPU-only 64 GB machines.
Mac M5 Pro reported as ~8x faster tokens/s vs M4 Pro for the same MoE Q4 model; unclear how much is CPU vs RAM.
Some consider Macs poor ROI for inference vs Nvidia GPUs; others report great real-world speed with MLX/LM Studio when supported.
Engines tried: llama.cpp, LM Studio, Ollama, vLLM, OpenWebUI. Opinions differ: some say Ollama is worst; others find vLLM finicky and prefer llama.cpp.

Quantization, context, and quality

Strong consensus: for coding, high-quality quants (Q6_K, Q8_0) are much better; heavily compressed (Q4) hurts reliability.
Advice: choose the highest quant your memory can handle, even if slower.
Larger context windows (64k–512k) are possible but can constrain speed or memory; some offload MoE to CPU to trade speed for context.

Coding, tool calling, and agents

Tool/function calling remains a weak spot: models get stuck in loops, miscall tools, or fail follow-up calls.
Newer tokenizer/chat templates reportedly improve Gemma 4 tool use, but results are mixed.
Some pair Gemma 4 with lighter agents (e.g., Pi) for lower overhead, or use draft models / speculative decoding for speed.
Emerging pattern: local Gemma handles bulk experiments and small refactors, cloud frontier models handle architecture and hard bugs.

Safety, censorship, and uncensored variants

One thread criticizes Gemma 4’s “censorship,” especially on medical questions, arguing that best-effort answers are needed offline.
Others defend refusals as safer than potentially wrong high-stakes advice.
Several note that “uncensored” or “abliterated” Gemma variants exist and can be used instead.

Comparisons and alternatives

Qwen 3.5, GLM Flash, some distilled Qwen-based models, and other open models are cited as stronger in some coding or tool-calling tasks.
Benchmarks shared in the thread show Gemma 4 26B-A4B exceptionally strong in one-shot coding but weaker in agentic scenarios and large-context tasks.

Related topics