Running local models on an M4 with 24GB memory
Hardware & Config Debates
- Initial confusion over “M4 with 24GB” is resolved: it refers to Apple M4 Macs (Air/Pro/Mini), not Nvidia Tesla GPUs.
- People share configs from 16GB Airs up to 128GB M5 Max MacBook Pros and 128GB desktops with GPUs.
- Strong sentiment that RAM capacity often matters more than raw CPU/GPU for local LLMs; 32–64GB is “usable,” 96–128GB considered a sweet spot for serious work.
- Some argue high-end Macs are poor value vs. cheaper desktops with used GPUs and lots of RAM/VRAM; others prefer paying once for a powerful laptop over ongoing cloud fees.
Model Choices & Performance
- Qwen 3.6/3.7 and Gemma 4 are repeatedly cited as current “good enough” local models, especially 27B–35B variants; 9B models are often described as weak for serious coding.
- 4–14B models are said to fall between GPT‑3.5 and GPT‑4o‑mini; still notably behind current frontier models.
- Reported speeds on Apple silicon for 20–31B models cluster around ~7–12 tokens/s with 8‑bit or Q4/Q5 quants; MoE models can have decent tokens/s but poor time‑to‑first‑token.
- Benchmarks and anecdotes show Gemma 4 31B and Qwen 3.6 27B/35B can sometimes rival older frontier behavior on constrained tasks, but not consistently.
Local vs Cloud Tradeoffs
- Several participants stress that local models are “nowhere near” Claude Opus / ChatGPT‑5.x for complex coding, long‑context reasoning, and reliability.
- Others report local models solving nontrivial tasks (debugging, protocol reverse‑engineering, security analysis) and being “good enough” for much daily work.
- Economic arguments: a multi‑thousand‑dollar laptop vs. decades of a $20/month subscription; local only makes sense to some if offline use, privacy, or latency are critical.
Use Cases & Workflows
- Effective use often involves interactive, step‑by‑step workflows, tight prompts, and frequent testing rather than long autonomous runs.
- Local models are seen as strong for boilerplate coding, small refactors, office drudgery (email, translation, simple docs); weaker for large projects and high‑risk legal/finance tasks.
- Some propose hybrid flows: frontier models for research/planning, local models for execution and editing.
Optimizations, Tooling & Meta
- New inference tricks (MTP, turboquant, Dflash, rotorquant) and engines (mlx, llama.cpp, LM Studio, Ollama, browser‑based agents) are actively explored; people believe speed headroom remains.
- There’s visible “bipolar” sentiment: excitement about technical progress and decentralization, alongside concern about overhyping local models and the impact of LLMs on software craftsmanship.