Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code
Running Gemma 4 Locally (Ollama, LM Studio, llama.cpp, oMLX)
- Users are successfully running Gemma 4 (especially 26B/31B, MoE A4B variants) on macOS with Ollama, LM Studio headless, llama.cpp server, and oMLX.
- Some report Gemma hanging or looping under Ollama with ROCm; switching to Vulkan and/or changing quantization (q8) and context size fixes it.
- Context window size is critical: default limits can break tool calling. People bump it via environment variables or app settings, sometimes to 64k–128k+ tokens.
- llama.cpp’s
llama-server(Anthropic-compatiblev1/messages) integrates cleanly with Claude Code; MLX/oMLX support is improving but has performance rough edges.
Performance, Hardware, and MoE Tradeoffs
- Benchmarks on Apple Silicon show llama.cpp (Metal) often generating tokens faster than oMLX for the same dynamic 4-bit Gemma 4 model.
- On unified-memory and mixed GPU/CPU setups, users see ~40–60 tok/s decode for 30–35B models in good cases, but some report 10+ minutes per answer for agentic coding on weaker GPUs/laptops.
- Debate over MoE memory: one side says MoE doesn’t reduce VRAM because all experts must be loaded; others note you can offload experts to RAM/disk, gaining capacity but incurring big I/O and latency penalties.
Claude Code vs Other Coding Agents
- Claude Code is popular as a frontend because it’s easy, has a good UX, and can point at any Anthropic-compatible or OpenAI-compatible local endpoint without subscription.
- Criticisms: token-inefficient, sometimes “loses its place,” halts mid-task, visual glitches, and weaker behavior with some local models.
- Alternatives mentioned as better or more flexible: OpenCode, Pi, Codex, Zed, Cursor, “caveman” mode, cloclo, and others; opinions differ on which harness feels best.
- Some prefer simpler or self-controlled sandboxing over built-in sandboxes like Codex’s.
Model Quality, Use Cases, and Ecosystem Trends
- Mixed impressions of Gemma 4 for agentic coding; some prefer Qwen3-coder or Qwen3.5 MoE variants, which benchmark higher for coding tasks.
- A specific interactive chat template for Gemma 4 in llama.cpp is reported to dramatically reduce looping and improve task completion.
- One view: harnesses and models are now decoupled and harnesses are becoming commodities; another view: models are commoditizing while harnesses and RL tuning drive real gains—many think both layers are commoditizing.
- Enthusiasm: local models feel increasingly “pleasant,” promising private, cheap daily use with cloud models reserved for harder tasks.
- Skepticism: even with expensive GPUs or high-end laptops, local models still lag cloud frontier models on speed and quality, especially for heavy coding agents.