Experimenting with Local LLMs on macOS

In-browser local LLMs and sandboxing

  • Multiple projects already run LLMs fully in the browser via WebGPU/WASM (MLC web-llm, transformers.js demos, webGPU Spaces, wllama, webNN samples).
  • A key UX desire is a pure HTML page with a “Select model from disk” button, loading local files without upload; someone demonstrates this pattern using transformers.js + a local ONNX model folder.
  • There’s frustration that WebGPU isn’t enabled by default on Linux; some want WebGL-based solutions or non-GPU WASM fallbacks.
  • Others argue browser sandboxing is overrated compared to unprivileged containers/VMs, which can also isolate GPU workloads.

macOS local LLM tooling and interfaces

  • Popular tools: LM Studio (with OpenAI-compatible server), Ollama, On-Device AI, Pico AI Server + Witsy, Osaurus, llamafile, DEVONThink AI features, Open WebUI, Electron-based UIs.
  • Some emphasize “no-install” browser-only experiences; others accept native apps or Docker if they give a simple chat UI plus model dropdown.

Hardware limits, Apple Silicon, and NPUs

  • Rule-of-thumb: 12–20B params is near the practical upper bound on 16GB RAM; some recommend sticking to 4–8B on such machines.
  • Most macOS tooling runs on the GPU via Metal; the Apple Neural Engine is seen as underused or too weak for large LLMs, and low-level access is limited.
  • There’s debate over whether frameworks like MLX actually target the ANE; consensus in the thread is “mostly GPU, ANE not really for big LLMs”.
  • Some describe Mac Studio 128–512GB setups running 120B–600B models at usable token rates, but prompt ingestion can be very slow.

Hallucinations, reliability, and behavior

  • A vivid example: a local Hermes/Mistral model fabricates an interview with Sun Tzu despite explicit instructions not to add content, undermining trust for “editing-only” tasks.
  • Commenters note LLMs are statistical, not logical; fine-tuning has intentionally biased them toward answering rather than deferring, making hallucinations hard to eliminate.
  • There’s concern about anthropomorphizing models and treating “emergent” behavior as more than sophisticated pattern completion.

Practical use cases for local models

  • Suggested “actually useful” applications:
    • Coding assistance and prototyping (Qwen, GLM, GPT-OSS models), including editor integration via tools like continue.dev.
    • Summarization and organization of personal data: diaries, Obsidian notes, email, calendars, screenshots, semantic desktop search.
    • On-device automation: classification, grammar checking, embeddings-based search, offline Q&A in poor connectivity scenarios.
    • Privacy-sensitive workflows (financial data, personal journals) where cloud use feels unacceptable.

Model choice, sizes, and recommended setups

  • Frequently mentioned models:
    • General/coding: Qwen3-30B A3B (and coder variant), GLM-4.5(-Air), GPT-OSS-20B/120B, Gemma 3 (12B and 270M), Mistral small/“Minstral”.
    • Very small tasks: Gemma3-270M for email summarization; tiny models for embeddings and classification.
  • Users report that on 16–32GB Macs, aggressively quantized ~14–20B models are borderline; ≥48–64GB is advised for 24–30B and above.
  • Some warn Ollama currently “hobbles” tool use for certain families (Qwen/DeepSeek) due to missing tool prompt sections; alternatives like LM Studio or raw llama.cpp are suggested.

Cloud vs local and home inference boxes

  • One camp expects local LLMs plus specialized small models to replace cloud use for many tasks; another argues the hardware gap to frontier models will keep cloud dominant for years.
  • Proposals include a dedicated “home LLM server” (high-RAM Mac Studio or similar) accessed from thin clients or phones, possibly at $5k–$20k price points; others call this economically or practically “ridiculous” for most users.
  • Some see “secure/private cloud compute” as the likely direction instead, with local strictly for niche or privacy-focused use.

Debate over Apple’s AI strategy

  • Critics argue Apple is “late” and overly conservative: not exposing ANE, not selling datacenter-grade silicon, not aggressively optimizing for LLMs.
  • Defenders point to Apple’s massive shareholder returns, consumer focus, and deliberate, slow-roll approach (“late but polished”), suggesting avoiding the AI hardware arms race may be rational.
  • There’s broad agreement that Apple Silicon’s unified memory is a strong advantage for local inference, but disagreement over whether Apple should extend this into enterprise/datacenter markets.