Running Qwen3 on your macbook, using MLX, to vibe code for free

Local Qwen3 “vibe coding” setup

  • Thread centers on running Qwen3-30B locally on Apple Silicon via MLX and using Localforge as an autonomous coding agent.
  • Agent can run inside a project folder, execute shell commands, inspect files, and iterate similarly to Claude Code.
  • Tool calling quality varies by size: 8B is described as poor, 30B “OK but random” and needs robust wrappers.

Model quality and coding ability

  • Some find Qwen3-30B-A3B “very impressive” and close to frontier models for general tasks, especially with proper sampling params.
  • Others report serious issues for coding: loops, forgetting the task, getting stuck on repeated tool calls, or failing on modest prompts.
  • Several commenters say Qwen3 is not yet reliable for serious coding; recommend Qwen2.5-32B, Cogito-32B, GLM-32B, or cloud models (Claude, Gemini, Sonnet).
  • Sub-1B and 4B variants: 0.6B is seen as useful for simple extraction or draft/speculative decoding; 4B fares surprisingly well for lightweight tasks.

Performance, RAM, and hardware

  • Rule-of-thumb cited: ~1 GB RAM per billion params for 8‑bit; 4–6 bit quantization dramatically lowers that.
  • 24 GB Macs struggle with 27B; 32–64 GB can run 27–30B but may crowd out other apps.
  • Reported speeds: 30B around 40–70 tok/s on high-end M1/M3 Max with Q4 quant; ~15 tok/s on RTX 3060 or M4 Air; 20 GB VRAM typical for 30B Q4.
  • 16 GB Macs are advised to stick to ~12B quantized models.

MLX, Ollama, and MPS

  • MLX is praised as Apple-Silicon–optimized, faster and more efficient than GGUF-on-GPU; uses Apple’s MPS stack under the hood.
  • Ollama supports Qwen3 but is reported slower for 30B; users suggest llama.cpp (with recent commits) or LM Studio with MLX backend.
  • One gotcha: MLX setup requires the exact model name (e.g., mlx-community/Qwen3-30B-A3B-8bit) or downloads will 404.

Local vs cloud tradeoffs & use cases

  • Many enjoy that local models are now “usable” on personal machines and improving over time, though still behind frontier models for coding and factual accuracy.
  • Reasons to run local: data sovereignty, offline use, experimentation, and avoiding detection of AI usage.
  • Others argue that for professional coding, paying for top-tier cloud models is still worth it.

Orchestration, agents, and MCP

  • Interest in a central proxy that normalizes access to multiple LLMs and logs all calls; LiteLLM, OpenRouter, Opik, and Simon Willison’s LLM tool are suggested.
  • MCP + Ollama bridges are mentioned for combining local models with tool servers and IDEs.
  • Localforge’s multi-agent story: users must explicitly choose agents; routing is implemented via function calls and an “expert model” defined in the system prompt, not automatic.

“Vibe coding” discussion & meta

  • “Vibe coding” is discussed as AI-driven development where users accept code they don’t fully understand, largely via prompt iteration.
  • Some are amused or concerned about its implications for careers; others say tools are still far from replacing developers, especially for refactoring and adhering to existing architectures.
  • A side thread debates disclosure: some see the post as stealth promotion for Localforge and argue the relationship should be clearly stated, even for open-source projects.