2025-09-08

Experimenting with Local LLMs on macOS

In-browser local LLMs and sandboxing

Multiple projects already run LLMs fully in the browser via WebGPU/WASM (MLC web-llm, transformers.js demos, webGPU Spaces, wllama, webNN samples).
A key UX desire is a pure HTML page with a “Select model from disk” button, loading local files without upload; someone demonstrates this pattern using transformers.js + a local ONNX model folder.
There’s frustration that WebGPU isn’t enabled by default on Linux; some want WebGL-based solutions or non-GPU WASM fallbacks.
Others argue browser sandboxing is overrated compared to unprivileged containers/VMs, which can also isolate GPU workloads.

macOS local LLM tooling and interfaces

Popular tools: LM Studio (with OpenAI-compatible server), Ollama, On-Device AI, Pico AI Server + Witsy, Osaurus, llamafile, DEVONThink AI features, Open WebUI, Electron-based UIs.
Some emphasize “no-install” browser-only experiences; others accept native apps or Docker if they give a simple chat UI plus model dropdown.

Hardware limits, Apple Silicon, and NPUs

Rule-of-thumb: 12–20B params is near the practical upper bound on 16GB RAM; some recommend sticking to 4–8B on such machines.
Most macOS tooling runs on the GPU via Metal; the Apple Neural Engine is seen as underused or too weak for large LLMs, and low-level access is limited.
There’s debate over whether frameworks like MLX actually target the ANE; consensus in the thread is “mostly GPU, ANE not really for big LLMs”.
Some describe Mac Studio 128–512GB setups running 120B–600B models at usable token rates, but prompt ingestion can be very slow.

Hallucinations, reliability, and behavior

A vivid example: a local Hermes/Mistral model fabricates an interview with Sun Tzu despite explicit instructions not to add content, undermining trust for “editing-only” tasks.
Commenters note LLMs are statistical, not logical; fine-tuning has intentionally biased them toward answering rather than deferring, making hallucinations hard to eliminate.
There’s concern about anthropomorphizing models and treating “emergent” behavior as more than sophisticated pattern completion.

Practical use cases for local models

Suggested “actually useful” applications:
- Coding assistance and prototyping (Qwen, GLM, GPT-OSS models), including editor integration via tools like continue.dev.
- Summarization and organization of personal data: diaries, Obsidian notes, email, calendars, screenshots, semantic desktop search.
- On-device automation: classification, grammar checking, embeddings-based search, offline Q&A in poor connectivity scenarios.
- Privacy-sensitive workflows (financial data, personal journals) where cloud use feels unacceptable.

Model choice, sizes, and recommended setups

Frequently mentioned models:
- General/coding: Qwen3-30B A3B (and coder variant), GLM-4.5(-Air), GPT-OSS-20B/120B, Gemma 3 (12B and 270M), Mistral small/“Minstral”.
- Very small tasks: Gemma3-270M for email summarization; tiny models for embeddings and classification.
Users report that on 16–32GB Macs, aggressively quantized ~14–20B models are borderline; ≥48–64GB is advised for 24–30B and above.
Some warn Ollama currently “hobbles” tool use for certain families (Qwen/DeepSeek) due to missing tool prompt sections; alternatives like LM Studio or raw llama.cpp are suggested.

Cloud vs local and home inference boxes

One camp expects local LLMs plus specialized small models to replace cloud use for many tasks; another argues the hardware gap to frontier models will keep cloud dominant for years.
Proposals include a dedicated “home LLM server” (high-RAM Mac Studio or similar) accessed from thin clients or phones, possibly at $5k–$20k price points; others call this economically or practically “ridiculous” for most users.
Some see “secure/private cloud compute” as the likely direction instead, with local strictly for niche or privacy-focused use.

Debate over Apple’s AI strategy

Critics argue Apple is “late” and overly conservative: not exposing ANE, not selling datacenter-grade silicon, not aggressively optimizing for LLMs.
Defenders point to Apple’s massive shareholder returns, consumer focus, and deliberate, slow-roll approach (“late but polished”), suggesting avoiding the AI hardware arms race may be rational.
There’s broad agreement that Apple Silicon’s unified memory is a strong advantage for local inference, but disagreement over whether Apple should extend this into enterprise/datacenter markets.

Related topics