Ollama is now powered by MLX on Apple Silicon in preview
Local vs Cloud LLMs
- Many argue on-device LLMs are “the future” for privacy, offline use, lower marginal costs, and reduced vendor lock-in; others see them as complementary to more capable cloud models, not replacements.
- Strong disagreement on whether “most users” need frontier models. Some say smaller models suffice for grammar, summarization, simple Q&A; others insist frontier models’ better reliability and knowledge are critical for real decisions and work.
- Several expect a hybrid pattern: local models handle everyday tasks and orchestration, escalating to cloud models only when needed.
Performance, Hardware, and Energy
- Apple Silicon (M-series, especially high-RAM M4/M5) is widely seen as the current sweet spot for local inference due to unified memory and bandwidth; MLX exploits this better than Metal-based stacks.
- Mixed reports on comfort: models do run well, but generate substantial heat and fan noise under heavy load.
- Debate on energy: some argue datacenters with batching are 10–100x more efficient per token; others claim repurposing existing consumer hardware and avoiding massive AI datacenters could cut total energy use. This remains contested.
- Memory is the main constraint (32–128 GB often cited); some lament lack of SSD offload in MLX compared to emerging GGUF approaches.
Model Quality and Use Cases
- Open-weight models like Qwen 3.5 (4B–70B, MoE variants) are frequently praised as “good enough” for many coding and agent workflows, but still fall short of top-tier proprietary models (Claude, GPT, Gemini) in reasoning, reliability, and tool use.
- Local models are used for: coding assistants, document RAG, journaling analysis, real-time voice practice on phones, shell-command helpers, and domain-specific agents.
- Tool-calling plus local knowledge bases (e.g., Wikipedia mirrors) are seen as key to compensating for smaller model knowledge.
Ecosystem: Ollama, MLX, and Alternatives
- Ollama wins points for simple CLI/API and Docker-like UX; criticism centers on slower adoption of MLX and some rough edges.
- MLX backends are reported modestly to significantly faster than llama.cpp/Metal on Macs, at the cost of more RAM.
- Competing stacks (LM Studio, Lemonade, llama.cpp, omlx) emphasize earlier MLX support, SSD KV caching, or better optimization.
Open Models and Sustainability
- Concern that open-weight SOTA depends on large corporate or state funding; unclear long-term business models (fine-tuning services, lump-sum licensing, B2B) are discussed but unresolved.