Qwen3-Coder-Next
Unsloth GGUFs & Quantization Choices
- Unsloth released “Dynamic” (UD) GGUFs that upcast “important” layers to higher precision using a calibration dataset; non‑UD are standard llama.cpp quants.
- Goal of dynamic quantization: smaller models with less accuracy loss. Recommended default for most hardware is UD‑Q4_K_XL; MXFP4_MOE is another option (especially on NVIDIA).
- Users asked for clearer docs on filename components and trade‑offs between Q4/Q6/Q8; answer was essentially: quality vs speed is highly hardware‑dependent, so you must empirically test.
- Compared to Qwen’s own GGUFs, Unsloth’s are claimed to be better calibrated; Q8_0 is effectively the same, lower quants differ.
Local Performance & Hardware Experiences
- Many successful runs on consumer hardware:
- 7900 XTX, 7900 XTX/XT users report ~10–40 tok/s with part of MoE offloaded to RAM, ~60GB+ system memory used.
- RTX 6000 Blackwell (96GB) runs Q8_0 smoothly at >60 tok/s, 128k+ context feasible.
- RTX 3090+4090 setups get 80 tok/s with 96k context at moderate quantization.
- Strix Halo and DGX Spark run Q4–Q8 variants at ~25–40 tok/s; FP8 via vLLM is memory‑heavy and not superior to good 4‑bit GGUF in practice.
- Apple Silicon: mixed results. MLX is much faster than llama.cpp but has KV‑cache/branching issues that hurt agentic workflows; some find Qwen3‑next “not well supported” on Macs. Others get high tps with MLX LM Studio builds.
Real‑World Capability vs Frontier Models
- Paper claims: near‑Sonnet‑4.5 SWE-Bench Pro performance with only ~3B active parameters.
- User tests (often on Q2/Q4) generally find it strong “for a local model,” but not at Sonnet‑4.5 / Opus level; some compare it closer to Haiku or older Sonnet 3.7/4.0. Several note looping/“thinking” stalls.
- Consensus: you need higher‑precision (Q6–Q8) to compare fairly; low‑bit quants significantly degrade quality.
Agentic Coding, Tools & Context
- Strong interest in using Qwen3‑Coder‑Next as a fast “junior dev” subagent, with frontier models handling planning/complex reasoning.
- People report good results in OpenCode, Codex CLI, and Claude Code via local backends, but tool‑calling can be brittle:
- Some small/older models fail with XML‑based tool schemas or loop on simple shell commands.
- Workarounds include JSON‑tool‑aware custom CLIs, proxies, tuning repeat penalties, and temperature=0.
- Context remains a key bottleneck for real projects: even with 100k–256k support, coding agents can rapidly exhaust windows when scanning multiple files. Subagents with separate contexts are suggested as mitigation.
Local vs Cloud Economics & Future Trajectory
- Local inference is attractive for high‑volume, latency‑tolerant coding agents: API retries and tool‑call failures can inflate cloud costs by 40–60%.
- Counterpoint: cheap Chinese APIs (DeepSeek, GLM, Kimi) plus high tps from providers may undercut home hardware when utilization is high.
- Broader debate:
- One side expects open/local to always lag frontier significantly due to data and training cost moats; huge models (possibly 1–2T params) are seen as inherently superior.
- Others argue “good enough” small models will win for many tasks, as with consumer cars vs supercars; future architectural advances and distillation could shift the size–capability frontier.
- Concern that hardware makers favor datacenters over consumers, risking a future where powerful compute is mostly rented.
Anthropic / Claude Code & Competition Concerns
- Several participants cancelled Claude Code after Anthropic blocked use of individual subscription plans (Max/Pro) via third‑party agents like OpenCode, or after bans for wrapping Claude Code in custom interfaces.
- Defenders note:
- Subscription plans were marketed for Anthropic’s own tools, not as a general API; they are oversubscribed and subsidized under assumed usage patterns.
- Heavy third‑party agent use breaks those assumptions; users are expected to pay API rates for that.
- Critics call this anti‑competitive: cheap tokens are effectively tied to Claude Code, making it harder for independent agents to compete on equal pricing.
- Broader anxiety: dependence on a few frontier providers, possible future “enshittification,” and risk that access can be revoked arbitrarily. Many see open/local models (including large ones rented on generic cloud VMs) as essential for long‑term autonomy.
Misc Technical & Conceptual Points
- Clarifications:
- SWE‑Bench “agent turns” chart is just a boxplot of turn distributions per task, not error bars.
- Context management (truncation, cleanup) is always outside the model; Qwen itself just consumes text.
- Some worry about CCP‑aligned censorship; replies suggest open weights can be fine‑tuned or “unaligned” if desired.
- Users request better standardized benchmarks for “local” scenarios (time‑to‑first‑token, tps, memory, context) on standard hardware classes, and clearer terminology distinguishing true self‑hosted vs LAN vs hosted “local” tools.