2025-05-01

Running Qwen3 on your macbook, using MLX, to vibe code for free

Local Qwen3 “vibe coding” setup

Thread centers on running Qwen3-30B locally on Apple Silicon via MLX and using Localforge as an autonomous coding agent.
Agent can run inside a project folder, execute shell commands, inspect files, and iterate similarly to Claude Code.
Tool calling quality varies by size: 8B is described as poor, 30B “OK but random” and needs robust wrappers.

Model quality and coding ability

Some find Qwen3-30B-A3B “very impressive” and close to frontier models for general tasks, especially with proper sampling params.
Others report serious issues for coding: loops, forgetting the task, getting stuck on repeated tool calls, or failing on modest prompts.
Several commenters say Qwen3 is not yet reliable for serious coding; recommend Qwen2.5-32B, Cogito-32B, GLM-32B, or cloud models (Claude, Gemini, Sonnet).
Sub-1B and 4B variants: 0.6B is seen as useful for simple extraction or draft/speculative decoding; 4B fares surprisingly well for lightweight tasks.

Performance, RAM, and hardware

Rule-of-thumb cited: ~1 GB RAM per billion params for 8‑bit; 4–6 bit quantization dramatically lowers that.
24 GB Macs struggle with 27B; 32–64 GB can run 27–30B but may crowd out other apps.
Reported speeds: 30B around 40–70 tok/s on high-end M1/M3 Max with Q4 quant; ~15 tok/s on RTX 3060 or M4 Air; 20 GB VRAM typical for 30B Q4.
16 GB Macs are advised to stick to ~12B quantized models.

MLX, Ollama, and MPS

MLX is praised as Apple-Silicon–optimized, faster and more efficient than GGUF-on-GPU; uses Apple’s MPS stack under the hood.
Ollama supports Qwen3 but is reported slower for 30B; users suggest llama.cpp (with recent commits) or LM Studio with MLX backend.
One gotcha: MLX setup requires the exact model name (e.g., mlx-community/Qwen3-30B-A3B-8bit) or downloads will 404.

Local vs cloud tradeoffs & use cases

Many enjoy that local models are now “usable” on personal machines and improving over time, though still behind frontier models for coding and factual accuracy.
Reasons to run local: data sovereignty, offline use, experimentation, and avoiding detection of AI usage.
Others argue that for professional coding, paying for top-tier cloud models is still worth it.

Orchestration, agents, and MCP

Interest in a central proxy that normalizes access to multiple LLMs and logs all calls; LiteLLM, OpenRouter, Opik, and Simon Willison’s LLM tool are suggested.
MCP + Ollama bridges are mentioned for combining local models with tool servers and IDEs.
Localforge’s multi-agent story: users must explicitly choose agents; routing is implemented via function calls and an “expert model” defined in the system prompt, not automatic.

“Vibe coding” discussion & meta

“Vibe coding” is discussed as AI-driven development where users accept code they don’t fully understand, largely via prompt iteration.
Some are amused or concerned about its implications for careers; others say tools are still far from replacing developers, especially for refactoring and adhering to existing architectures.
A side thread debates disclosure: some see the post as stealth promotion for Localforge and argue the relationship should be clearly stated, even for open-source projects.

Related topics