How to setup a local coding agent on macOS

Video and demo context

  • Some readers couldn’t see the embedded video; direct and full links were shared.
  • The demo is real-time and shows a “usable” speed, but one clip is cut before the model responds, so it’s not a perfect latency benchmark.

Tooling: llama.cpp vs Ollama vs LM Studio vs oMLX

  • Several argue you don’t need huggingface-cli; llama.cpp’s -hf (and -hfd for draft) can fetch models directly, with LLAMA_CACHE controlling storage.
  • Ollama is seen by some as convenient (GUI, easy integration with agents), by others as slower and unnecessary since it wraps llama.cpp and adds overhead; one link warns against using it.
  • LM Studio is praised for ease of use, good Metal performance, and a nice UI, but some prefer pure open-source stacks and direct CLI control.
  • oMLX is highlighted as making MLX-based servers easy, doing caching and model selection, and integrating with various agent harnesses and UIs; others report mixed performance vs GGUF/llama.cpp.

Performance, MTP/QAT, and model choices

  • Mixed experiences with MTP: sometimes improves time to first token, sometimes negligible or even harmful (e.g., markup issues, little gain on MoE models).
  • QAT and speculative decoding can make dense ~27–31B models more acceptable; MoE vs dense tradeoffs are discussed.
  • Benchmarks that use ~128 tokens are criticized as misleading; longer prompts and context lengths are needed for meaningful numbers.

Hardware and feasibility on Macs

  • Experiences range from 16GB Airs up to 128GB M4/M5 Max machines.
  • Consensus: large dense models need ≥48–64GB for comfort; smaller or more aggressively quantized models will run on 16–24GB but may be slow and CPU-bound.
  • Some consider local models “toys” versus cloud in both speed and capability; others find them fully “good enough” for offline coding and experimentation.

Quality vs speed and use cases

  • Several complain that articles focus on tokens/sec and ignore answer quality.
  • Others respond that base model quality is benchmarked elsewhere; the article is about running the model, so speed is the main variable.
  • General agreement that local models are more like “super autocomplete” than full autonomous agents, but useful for boilerplate, explanation, offline help, and privacy-sensitive work.

Local vs cloud, cost, and philosophy

  • Economic skeptics argue extra hardware cost rarely beats simply paying for cloud APIs.
  • Enthusiasts value privacy, offline reliability, learning how systems work, avoiding cloud dependence, and being able to swap models/agent harnesses.
  • There’s tension between “let AI be your subordinate to get work done” and concerns about over-reliance and “slop” production.