2026-04-05

Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code

Running Gemma 4 Locally (Ollama, LM Studio, llama.cpp, oMLX)

Users are successfully running Gemma 4 (especially 26B/31B, MoE A4B variants) on macOS with Ollama, LM Studio headless, llama.cpp server, and oMLX.
Some report Gemma hanging or looping under Ollama with ROCm; switching to Vulkan and/or changing quantization (q8) and context size fixes it.
Context window size is critical: default limits can break tool calling. People bump it via environment variables or app settings, sometimes to 64k–128k+ tokens.
llama.cpp’s llama-server (Anthropic-compatible v1/messages) integrates cleanly with Claude Code; MLX/oMLX support is improving but has performance rough edges.

Performance, Hardware, and MoE Tradeoffs

Benchmarks on Apple Silicon show llama.cpp (Metal) often generating tokens faster than oMLX for the same dynamic 4-bit Gemma 4 model.
On unified-memory and mixed GPU/CPU setups, users see ~40–60 tok/s decode for 30–35B models in good cases, but some report 10+ minutes per answer for agentic coding on weaker GPUs/laptops.
Debate over MoE memory: one side says MoE doesn’t reduce VRAM because all experts must be loaded; others note you can offload experts to RAM/disk, gaining capacity but incurring big I/O and latency penalties.

Claude Code vs Other Coding Agents

Claude Code is popular as a frontend because it’s easy, has a good UX, and can point at any Anthropic-compatible or OpenAI-compatible local endpoint without subscription.
Criticisms: token-inefficient, sometimes “loses its place,” halts mid-task, visual glitches, and weaker behavior with some local models.
Alternatives mentioned as better or more flexible: OpenCode, Pi, Codex, Zed, Cursor, “caveman” mode, cloclo, and others; opinions differ on which harness feels best.
Some prefer simpler or self-controlled sandboxing over built-in sandboxes like Codex’s.

Model Quality, Use Cases, and Ecosystem Trends

Mixed impressions of Gemma 4 for agentic coding; some prefer Qwen3-coder or Qwen3.5 MoE variants, which benchmark higher for coding tasks.
A specific interactive chat template for Gemma 4 in llama.cpp is reported to dramatically reduce looping and improve task completion.
One view: harnesses and models are now decoupled and harnesses are becoming commodities; another view: models are commoditizing while harnesses and RL tuning drive real gains—many think both layers are commoditizing.
Enthusiasm: local models feel increasingly “pleasant,” promising private, cheap daily use with cloud models reserved for harder tasks.
Skepticism: even with expensive GPUs or high-end laptops, local models still lag cloud frontier models on speed and quality, especially for heavy coding agents.

Related topics