2026-02-03

Qwen3-Coder-Next

Unsloth GGUFs & Quantization Choices

Unsloth released “Dynamic” (UD) GGUFs that upcast “important” layers to higher precision using a calibration dataset; non‑UD are standard llama.cpp quants.
Goal of dynamic quantization: smaller models with less accuracy loss. Recommended default for most hardware is UD‑Q4_K_XL; MXFP4_MOE is another option (especially on NVIDIA).
Users asked for clearer docs on filename components and trade‑offs between Q4/Q6/Q8; answer was essentially: quality vs speed is highly hardware‑dependent, so you must empirically test.
Compared to Qwen’s own GGUFs, Unsloth’s are claimed to be better calibrated; Q8_0 is effectively the same, lower quants differ.

Local Performance & Hardware Experiences

Many successful runs on consumer hardware:
- 7900 XTX, 7900 XTX/XT users report ~10–40 tok/s with part of MoE offloaded to RAM, ~60GB+ system memory used.
- RTX 6000 Blackwell (96GB) runs Q8_0 smoothly at >60 tok/s, 128k+ context feasible.
- RTX 3090+4090 setups get 80 tok/s with 96k context at moderate quantization.
- Strix Halo and DGX Spark run Q4–Q8 variants at ~25–40 tok/s; FP8 via vLLM is memory‑heavy and not superior to good 4‑bit GGUF in practice.
Apple Silicon: mixed results. MLX is much faster than llama.cpp but has KV‑cache/branching issues that hurt agentic workflows; some find Qwen3‑next “not well supported” on Macs. Others get high tps with MLX LM Studio builds.

Real‑World Capability vs Frontier Models

Paper claims: near‑Sonnet‑4.5 SWE-Bench Pro performance with only ~3B active parameters.
User tests (often on Q2/Q4) generally find it strong “for a local model,” but not at Sonnet‑4.5 / Opus level; some compare it closer to Haiku or older Sonnet 3.7/4.0. Several note looping/“thinking” stalls.
Consensus: you need higher‑precision (Q6–Q8) to compare fairly; low‑bit quants significantly degrade quality.

Agentic Coding, Tools & Context

Strong interest in using Qwen3‑Coder‑Next as a fast “junior dev” subagent, with frontier models handling planning/complex reasoning.
People report good results in OpenCode, Codex CLI, and Claude Code via local backends, but tool‑calling can be brittle:
- Some small/older models fail with XML‑based tool schemas or loop on simple shell commands.
- Workarounds include JSON‑tool‑aware custom CLIs, proxies, tuning repeat penalties, and temperature=0.
Context remains a key bottleneck for real projects: even with 100k–256k support, coding agents can rapidly exhaust windows when scanning multiple files. Subagents with separate contexts are suggested as mitigation.

Local vs Cloud Economics & Future Trajectory

Local inference is attractive for high‑volume, latency‑tolerant coding agents: API retries and tool‑call failures can inflate cloud costs by 40–60%.
Counterpoint: cheap Chinese APIs (DeepSeek, GLM, Kimi) plus high tps from providers may undercut home hardware when utilization is high.
Broader debate:
- One side expects open/local to always lag frontier significantly due to data and training cost moats; huge models (possibly 1–2T params) are seen as inherently superior.
- Others argue “good enough” small models will win for many tasks, as with consumer cars vs supercars; future architectural advances and distillation could shift the size–capability frontier.
- Concern that hardware makers favor datacenters over consumers, risking a future where powerful compute is mostly rented.

Anthropic / Claude Code & Competition Concerns

Several participants cancelled Claude Code after Anthropic blocked use of individual subscription plans (Max/Pro) via third‑party agents like OpenCode, or after bans for wrapping Claude Code in custom interfaces.
Defenders note:
- Subscription plans were marketed for Anthropic’s own tools, not as a general API; they are oversubscribed and subsidized under assumed usage patterns.
- Heavy third‑party agent use breaks those assumptions; users are expected to pay API rates for that.
Critics call this anti‑competitive: cheap tokens are effectively tied to Claude Code, making it harder for independent agents to compete on equal pricing.
Broader anxiety: dependence on a few frontier providers, possible future “enshittification,” and risk that access can be revoked arbitrarily. Many see open/local models (including large ones rented on generic cloud VMs) as essential for long‑term autonomy.

Misc Technical & Conceptual Points

Clarifications:
- SWE‑Bench “agent turns” chart is just a boxplot of turn distributions per task, not error bars.
- Context management (truncation, cleanup) is always outside the model; Qwen itself just consumes text.
Some worry about CCP‑aligned censorship; replies suggest open weights can be fine‑tuned or “unaligned” if desired.
Users request better standardized benchmarks for “local” scenarios (time‑to‑first‑token, tps, memory, context) on standard hardware classes, and clearer terminology distinguishing true self‑hosted vs LAN vs hosted “local” tools.

Related topics