2026-06-12

How to setup a local coding agent on macOS

Video and demo context

Some readers couldn’t see the embedded video; direct and full links were shared.
The demo is real-time and shows a “usable” speed, but one clip is cut before the model responds, so it’s not a perfect latency benchmark.

Tooling: llama.cpp vs Ollama vs LM Studio vs oMLX

Several argue you don’t need huggingface-cli; llama.cpp’s -hf (and -hfd for draft) can fetch models directly, with LLAMA_CACHE controlling storage.
Ollama is seen by some as convenient (GUI, easy integration with agents), by others as slower and unnecessary since it wraps llama.cpp and adds overhead; one link warns against using it.
LM Studio is praised for ease of use, good Metal performance, and a nice UI, but some prefer pure open-source stacks and direct CLI control.
oMLX is highlighted as making MLX-based servers easy, doing caching and model selection, and integrating with various agent harnesses and UIs; others report mixed performance vs GGUF/llama.cpp.

Performance, MTP/QAT, and model choices

Mixed experiences with MTP: sometimes improves time to first token, sometimes negligible or even harmful (e.g., markup issues, little gain on MoE models).
QAT and speculative decoding can make dense ~27–31B models more acceptable; MoE vs dense tradeoffs are discussed.
Benchmarks that use ~128 tokens are criticized as misleading; longer prompts and context lengths are needed for meaningful numbers.

Hardware and feasibility on Macs

Experiences range from 16GB Airs up to 128GB M4/M5 Max machines.
Consensus: large dense models need ≥48–64GB for comfort; smaller or more aggressively quantized models will run on 16–24GB but may be slow and CPU-bound.
Some consider local models “toys” versus cloud in both speed and capability; others find them fully “good enough” for offline coding and experimentation.

Quality vs speed and use cases

Several complain that articles focus on tokens/sec and ignore answer quality.
Others respond that base model quality is benchmarked elsewhere; the article is about running the model, so speed is the main variable.
General agreement that local models are more like “super autocomplete” than full autonomous agents, but useful for boilerplate, explanation, offline help, and privacy-sensitive work.

Local vs cloud, cost, and philosophy

Economic skeptics argue extra hardware cost rarely beats simply paying for cloud APIs.
Enthusiasts value privacy, offline reliability, learning how systems work, avoiding cloud dependence, and being able to swap models/agent harnesses.
There’s tension between “let AI be your subordinate to get work done” and concerns about over-reliance and “slop” production.

Related topics