Running local models on an M4 with 24GB memory

Hardware & Config Debates

  • Initial confusion over “M4 with 24GB” is resolved: it refers to Apple M4 Macs (Air/Pro/Mini), not Nvidia Tesla GPUs.
  • People share configs from 16GB Airs up to 128GB M5 Max MacBook Pros and 128GB desktops with GPUs.
  • Strong sentiment that RAM capacity often matters more than raw CPU/GPU for local LLMs; 32–64GB is “usable,” 96–128GB considered a sweet spot for serious work.
  • Some argue high-end Macs are poor value vs. cheaper desktops with used GPUs and lots of RAM/VRAM; others prefer paying once for a powerful laptop over ongoing cloud fees.

Model Choices & Performance

  • Qwen 3.6/3.7 and Gemma 4 are repeatedly cited as current “good enough” local models, especially 27B–35B variants; 9B models are often described as weak for serious coding.
  • 4–14B models are said to fall between GPT‑3.5 and GPT‑4o‑mini; still notably behind current frontier models.
  • Reported speeds on Apple silicon for 20–31B models cluster around ~7–12 tokens/s with 8‑bit or Q4/Q5 quants; MoE models can have decent tokens/s but poor time‑to‑first‑token.
  • Benchmarks and anecdotes show Gemma 4 31B and Qwen 3.6 27B/35B can sometimes rival older frontier behavior on constrained tasks, but not consistently.

Local vs Cloud Tradeoffs

  • Several participants stress that local models are “nowhere near” Claude Opus / ChatGPT‑5.x for complex coding, long‑context reasoning, and reliability.
  • Others report local models solving nontrivial tasks (debugging, protocol reverse‑engineering, security analysis) and being “good enough” for much daily work.
  • Economic arguments: a multi‑thousand‑dollar laptop vs. decades of a $20/month subscription; local only makes sense to some if offline use, privacy, or latency are critical.

Use Cases & Workflows

  • Effective use often involves interactive, step‑by‑step workflows, tight prompts, and frequent testing rather than long autonomous runs.
  • Local models are seen as strong for boilerplate coding, small refactors, office drudgery (email, translation, simple docs); weaker for large projects and high‑risk legal/finance tasks.
  • Some propose hybrid flows: frontier models for research/planning, local models for execution and editing.

Optimizations, Tooling & Meta

  • New inference tricks (MTP, turboquant, Dflash, rotorquant) and engines (mlx, llama.cpp, LM Studio, Ollama, browser‑based agents) are actively explored; people believe speed headroom remains.
  • There’s visible “bipolar” sentiment: excitement about technical progress and decentralization, alongside concern about overhyping local models and the impact of LLMs on software craftsmanship.