2026-03-31

Ollama is now powered by MLX on Apple Silicon in preview

Local vs Cloud LLMs

Many argue on-device LLMs are “the future” for privacy, offline use, lower marginal costs, and reduced vendor lock-in; others see them as complementary to more capable cloud models, not replacements.
Strong disagreement on whether “most users” need frontier models. Some say smaller models suffice for grammar, summarization, simple Q&A; others insist frontier models’ better reliability and knowledge are critical for real decisions and work.
Several expect a hybrid pattern: local models handle everyday tasks and orchestration, escalating to cloud models only when needed.

Performance, Hardware, and Energy

Apple Silicon (M-series, especially high-RAM M4/M5) is widely seen as the current sweet spot for local inference due to unified memory and bandwidth; MLX exploits this better than Metal-based stacks.
Mixed reports on comfort: models do run well, but generate substantial heat and fan noise under heavy load.
Debate on energy: some argue datacenters with batching are 10–100x more efficient per token; others claim repurposing existing consumer hardware and avoiding massive AI datacenters could cut total energy use. This remains contested.
Memory is the main constraint (32–128 GB often cited); some lament lack of SSD offload in MLX compared to emerging GGUF approaches.

Model Quality and Use Cases

Open-weight models like Qwen 3.5 (4B–70B, MoE variants) are frequently praised as “good enough” for many coding and agent workflows, but still fall short of top-tier proprietary models (Claude, GPT, Gemini) in reasoning, reliability, and tool use.
Local models are used for: coding assistants, document RAG, journaling analysis, real-time voice practice on phones, shell-command helpers, and domain-specific agents.
Tool-calling plus local knowledge bases (e.g., Wikipedia mirrors) are seen as key to compensating for smaller model knowledge.

Ecosystem: Ollama, MLX, and Alternatives

Ollama wins points for simple CLI/API and Docker-like UX; criticism centers on slower adoption of MLX and some rough edges.
MLX backends are reported modestly to significantly faster than llama.cpp/Metal on Macs, at the cost of more RAM.
Competing stacks (LM Studio, Lemonade, llama.cpp, omlx) emphasize earlier MLX support, SSD KV caching, or better optimization.

Open Models and Sustainability

Concern that open-weight SOTA depends on large corporate or state funding; unclear long-term business models (fine-tuning services, lump-sum licensing, B2B) are discussed but unresolved.

Related topics