April 2026 TLDR Setup for Ollama and Gemma 4 26B on a Mac mini
Ollama vs Alternatives (LM Studio, llama.cpp, others)
- Many argue there’s “no reason” to use Ollama over llama.cpp or LM Studio, citing slower speed, worse defaults (e.g., short context), and lagging model support.
- Others find Ollama convenient as a backend for various frontends, and appreciate simple commands like
ollama pulland OpenAI-compatible APIs. - LM Studio is often recommended for beginners as a more polished GUI and model manager, but it’s closed source and desktop-focused.
- llama.cpp is seen as the “real” core engine: faster, more up to date, and more configurable, though initially perceived as harder to set up.
Licensing, Openness, and Ethics
- Strong criticism that Ollama started as a derivative of llama.cpp without proper attribution and “quasi-open source” behavior (e.g., mangling GGUF filenames, limited upstream contributions).
- Counterpoint: the code is MIT-licensed and publicly available, so in a narrow sense it is open source; issues center on attribution and ecosystem behavior, not access.
Performance Comparisons
- Benchmarks conflict:
- On an M4 Mac mini, one report finds Ollama ~25% faster than LM Studio for the same Gemma 4 model.
- On an older AMD GPU, others find llama.cpp fastest, LM Studio 3× faster than Ollama, and Ollama the slowest across several models.
- Perceived load times also differ; some feel Ollama loads models “nearly instantly” vs llama.cpp.
Usability and UX
- Ollama is praised for lowering the barrier: easy installs, simple CLI, no Hugging Face account, good default server functionality.
- Critics say its abstraction hides crucial details (model size, quantization, architecture optimizations), which misleads users about what they’re actually running.
- Several note that OSS tools often lack strong UX; LM Studio is cited as an exception despite being closed.
Gemma 4 Models: Early Impressions
- On an M4 Mac mini (24 GB), the ~10 GB Gemma 4 variant is described as fast and usable for small coding tasks; the ~20 GB variant works but is sluggish and RAM-heavy.
- Multiple reports: Gemma 4 is strong at tool use and data extraction, but early agentic coding performance is underwhelming compared to specialized coding models like Qwen 3.5.
- Some warn that early negative impressions may come from broken tokenizers or quantizations; implementations are still rapidly evolving.
Tool Calling & Backend Reliability
- Many tool-calling failures reported across LM Studio, llama.cpp, and Ollama (especially on launch day).
- Consensus: new open-weight releases are often buggy for weeks; users should expect to update engines and quantizations frequently and file bug reports.
- Proxies and shims (e.g., “tricks” layers) are used to emulate missing capabilities or fix prompt templates.
Local vs Cloud Models & Expectations vs Claude
- Several say open models are useful for moderate tasks and privacy-sensitive workflows, but not yet close to Claude Sonnet/Opus or top proprietary coders for complex projects.
- Advice for people considering hardware purchases: first try hosted versions (e.g., via aggregators) to understand capabilities and limitations.
- Some users experiment with open models for “agentic” coding but still fall back to Claude or other proprietary models for hard problems.
Hardware and Engine Notes (Macs, GPUs, MLX)
- Users report running Gemma 4 26B on MacBook Air/mini with 24–32 GB RAM; throughput in the ~20–40 tok/s range is seen as borderline for agents but fine for chat.
- Unified memory on Apple Silicon avoids separate CPU/GPU VRAM constraints but doesn’t eliminate latency concerns for large models.
- MLX/oMLX support for Gemma 4 is emerging; current builds support basic chat, with partial or in-progress support for special tokens and tool calling.