Running Qwen3 on your macbook, using MLX, to vibe code for free
Local Qwen3 “vibe coding” setup
- Thread centers on running Qwen3-30B locally on Apple Silicon via MLX and using Localforge as an autonomous coding agent.
- Agent can run inside a project folder, execute shell commands, inspect files, and iterate similarly to Claude Code.
- Tool calling quality varies by size: 8B is described as poor, 30B “OK but random” and needs robust wrappers.
Model quality and coding ability
- Some find Qwen3-30B-A3B “very impressive” and close to frontier models for general tasks, especially with proper sampling params.
- Others report serious issues for coding: loops, forgetting the task, getting stuck on repeated tool calls, or failing on modest prompts.
- Several commenters say Qwen3 is not yet reliable for serious coding; recommend Qwen2.5-32B, Cogito-32B, GLM-32B, or cloud models (Claude, Gemini, Sonnet).
- Sub-1B and 4B variants: 0.6B is seen as useful for simple extraction or draft/speculative decoding; 4B fares surprisingly well for lightweight tasks.
Performance, RAM, and hardware
- Rule-of-thumb cited: ~1 GB RAM per billion params for 8‑bit; 4–6 bit quantization dramatically lowers that.
- 24 GB Macs struggle with 27B; 32–64 GB can run 27–30B but may crowd out other apps.
- Reported speeds: 30B around 40–70 tok/s on high-end M1/M3 Max with Q4 quant; ~15 tok/s on RTX 3060 or M4 Air; 20 GB VRAM typical for 30B Q4.
- 16 GB Macs are advised to stick to ~12B quantized models.
MLX, Ollama, and MPS
- MLX is praised as Apple-Silicon–optimized, faster and more efficient than GGUF-on-GPU; uses Apple’s MPS stack under the hood.
- Ollama supports Qwen3 but is reported slower for 30B; users suggest llama.cpp (with recent commits) or LM Studio with MLX backend.
- One gotcha: MLX setup requires the exact model name (e.g.,
mlx-community/Qwen3-30B-A3B-8bit) or downloads will 404.
Local vs cloud tradeoffs & use cases
- Many enjoy that local models are now “usable” on personal machines and improving over time, though still behind frontier models for coding and factual accuracy.
- Reasons to run local: data sovereignty, offline use, experimentation, and avoiding detection of AI usage.
- Others argue that for professional coding, paying for top-tier cloud models is still worth it.
Orchestration, agents, and MCP
- Interest in a central proxy that normalizes access to multiple LLMs and logs all calls; LiteLLM, OpenRouter, Opik, and Simon Willison’s LLM tool are suggested.
- MCP + Ollama bridges are mentioned for combining local models with tool servers and IDEs.
- Localforge’s multi-agent story: users must explicitly choose agents; routing is implemented via function calls and an “expert model” defined in the system prompt, not automatic.
“Vibe coding” discussion & meta
- “Vibe coding” is discussed as AI-driven development where users accept code they don’t fully understand, largely via prompt iteration.
- Some are amused or concerned about its implications for careers; others say tools are still far from replacing developers, especially for refactoring and adhering to existing architectures.
- A side thread debates disclosure: some see the post as stealth promotion for Localforge and argue the relationship should be clearly stated, even for open-source projects.