Experimenting with Local LLMs on macOS
In-browser local LLMs and sandboxing
- Multiple projects already run LLMs fully in the browser via WebGPU/WASM (MLC web-llm, transformers.js demos, webGPU Spaces, wllama, webNN samples).
- A key UX desire is a pure HTML page with a “Select model from disk” button, loading local files without upload; someone demonstrates this pattern using transformers.js + a local ONNX model folder.
- There’s frustration that WebGPU isn’t enabled by default on Linux; some want WebGL-based solutions or non-GPU WASM fallbacks.
- Others argue browser sandboxing is overrated compared to unprivileged containers/VMs, which can also isolate GPU workloads.
macOS local LLM tooling and interfaces
- Popular tools: LM Studio (with OpenAI-compatible server), Ollama, On-Device AI, Pico AI Server + Witsy, Osaurus, llamafile, DEVONThink AI features, Open WebUI, Electron-based UIs.
- Some emphasize “no-install” browser-only experiences; others accept native apps or Docker if they give a simple chat UI plus model dropdown.
Hardware limits, Apple Silicon, and NPUs
- Rule-of-thumb: 12–20B params is near the practical upper bound on 16GB RAM; some recommend sticking to 4–8B on such machines.
- Most macOS tooling runs on the GPU via Metal; the Apple Neural Engine is seen as underused or too weak for large LLMs, and low-level access is limited.
- There’s debate over whether frameworks like MLX actually target the ANE; consensus in the thread is “mostly GPU, ANE not really for big LLMs”.
- Some describe Mac Studio 128–512GB setups running 120B–600B models at usable token rates, but prompt ingestion can be very slow.
Hallucinations, reliability, and behavior
- A vivid example: a local Hermes/Mistral model fabricates an interview with Sun Tzu despite explicit instructions not to add content, undermining trust for “editing-only” tasks.
- Commenters note LLMs are statistical, not logical; fine-tuning has intentionally biased them toward answering rather than deferring, making hallucinations hard to eliminate.
- There’s concern about anthropomorphizing models and treating “emergent” behavior as more than sophisticated pattern completion.
Practical use cases for local models
- Suggested “actually useful” applications:
- Coding assistance and prototyping (Qwen, GLM, GPT-OSS models), including editor integration via tools like continue.dev.
- Summarization and organization of personal data: diaries, Obsidian notes, email, calendars, screenshots, semantic desktop search.
- On-device automation: classification, grammar checking, embeddings-based search, offline Q&A in poor connectivity scenarios.
- Privacy-sensitive workflows (financial data, personal journals) where cloud use feels unacceptable.
Model choice, sizes, and recommended setups
- Frequently mentioned models:
- General/coding: Qwen3-30B A3B (and coder variant), GLM-4.5(-Air), GPT-OSS-20B/120B, Gemma 3 (12B and 270M), Mistral small/“Minstral”.
- Very small tasks: Gemma3-270M for email summarization; tiny models for embeddings and classification.
- Users report that on 16–32GB Macs, aggressively quantized ~14–20B models are borderline; ≥48–64GB is advised for 24–30B and above.
- Some warn Ollama currently “hobbles” tool use for certain families (Qwen/DeepSeek) due to missing tool prompt sections; alternatives like LM Studio or raw llama.cpp are suggested.
Cloud vs local and home inference boxes
- One camp expects local LLMs plus specialized small models to replace cloud use for many tasks; another argues the hardware gap to frontier models will keep cloud dominant for years.
- Proposals include a dedicated “home LLM server” (high-RAM Mac Studio or similar) accessed from thin clients or phones, possibly at $5k–$20k price points; others call this economically or practically “ridiculous” for most users.
- Some see “secure/private cloud compute” as the likely direction instead, with local strictly for niche or privacy-focused use.
Debate over Apple’s AI strategy
- Critics argue Apple is “late” and overly conservative: not exposing ANE, not selling datacenter-grade silicon, not aggressively optimizing for LLMs.
- Defenders point to Apple’s massive shareholder returns, consumer focus, and deliberate, slow-roll approach (“late but polished”), suggesting avoiding the AI hardware arms race may be rational.
- There’s broad agreement that Apple Silicon’s unified memory is a strong advantage for local inference, but disagreement over whether Apple should extend this into enterprise/datacenter markets.