2026-05-10

Running local models on an M4 with 24GB memory

Hardware & Config Debates

Initial confusion over “M4 with 24GB” is resolved: it refers to Apple M4 Macs (Air/Pro/Mini), not Nvidia Tesla GPUs.
People share configs from 16GB Airs up to 128GB M5 Max MacBook Pros and 128GB desktops with GPUs.
Strong sentiment that RAM capacity often matters more than raw CPU/GPU for local LLMs; 32–64GB is “usable,” 96–128GB considered a sweet spot for serious work.
Some argue high-end Macs are poor value vs. cheaper desktops with used GPUs and lots of RAM/VRAM; others prefer paying once for a powerful laptop over ongoing cloud fees.

Model Choices & Performance

Qwen 3.6/3.7 and Gemma 4 are repeatedly cited as current “good enough” local models, especially 27B–35B variants; 9B models are often described as weak for serious coding.
4–14B models are said to fall between GPT‑3.5 and GPT‑4o‑mini; still notably behind current frontier models.
Reported speeds on Apple silicon for 20–31B models cluster around ~7–12 tokens/s with 8‑bit or Q4/Q5 quants; MoE models can have decent tokens/s but poor time‑to‑first‑token.
Benchmarks and anecdotes show Gemma 4 31B and Qwen 3.6 27B/35B can sometimes rival older frontier behavior on constrained tasks, but not consistently.

Local vs Cloud Tradeoffs

Several participants stress that local models are “nowhere near” Claude Opus / ChatGPT‑5.x for complex coding, long‑context reasoning, and reliability.
Others report local models solving nontrivial tasks (debugging, protocol reverse‑engineering, security analysis) and being “good enough” for much daily work.
Economic arguments: a multi‑thousand‑dollar laptop vs. decades of a $20/month subscription; local only makes sense to some if offline use, privacy, or latency are critical.

Use Cases & Workflows

Effective use often involves interactive, step‑by‑step workflows, tight prompts, and frequent testing rather than long autonomous runs.
Local models are seen as strong for boilerplate coding, small refactors, office drudgery (email, translation, simple docs); weaker for large projects and high‑risk legal/finance tasks.
Some propose hybrid flows: frontier models for research/planning, local models for execution and editing.

Optimizations, Tooling & Meta

New inference tricks (MTP, turboquant, Dflash, rotorquant) and engines (mlx, llama.cpp, LM Studio, Ollama, browser‑based agents) are actively explored; people believe speed headroom remains.
There’s visible “bipolar” sentiment: excitement about technical progress and decentralization, alongside concern about overhyping local models and the impact of LLMs on software craftsmanship.

Related topics