Qwen3-Coder-Next

Unsloth GGUFs & Quantization Choices

  • Unsloth released “Dynamic” (UD) GGUFs that upcast “important” layers to higher precision using a calibration dataset; non‑UD are standard llama.cpp quants.
  • Goal of dynamic quantization: smaller models with less accuracy loss. Recommended default for most hardware is UD‑Q4_K_XL; MXFP4_MOE is another option (especially on NVIDIA).
  • Users asked for clearer docs on filename components and trade‑offs between Q4/Q6/Q8; answer was essentially: quality vs speed is highly hardware‑dependent, so you must empirically test.
  • Compared to Qwen’s own GGUFs, Unsloth’s are claimed to be better calibrated; Q8_0 is effectively the same, lower quants differ.

Local Performance & Hardware Experiences

  • Many successful runs on consumer hardware:
    • 7900 XTX, 7900 XTX/XT users report ~10–40 tok/s with part of MoE offloaded to RAM, ~60GB+ system memory used.
    • RTX 6000 Blackwell (96GB) runs Q8_0 smoothly at >60 tok/s, 128k+ context feasible.
    • RTX 3090+4090 setups get 80 tok/s with 96k context at moderate quantization.
    • Strix Halo and DGX Spark run Q4–Q8 variants at ~25–40 tok/s; FP8 via vLLM is memory‑heavy and not superior to good 4‑bit GGUF in practice.
  • Apple Silicon: mixed results. MLX is much faster than llama.cpp but has KV‑cache/branching issues that hurt agentic workflows; some find Qwen3‑next “not well supported” on Macs. Others get high tps with MLX LM Studio builds.

Real‑World Capability vs Frontier Models

  • Paper claims: near‑Sonnet‑4.5 SWE-Bench Pro performance with only ~3B active parameters.
  • User tests (often on Q2/Q4) generally find it strong “for a local model,” but not at Sonnet‑4.5 / Opus level; some compare it closer to Haiku or older Sonnet 3.7/4.0. Several note looping/“thinking” stalls.
  • Consensus: you need higher‑precision (Q6–Q8) to compare fairly; low‑bit quants significantly degrade quality.

Agentic Coding, Tools & Context

  • Strong interest in using Qwen3‑Coder‑Next as a fast “junior dev” subagent, with frontier models handling planning/complex reasoning.
  • People report good results in OpenCode, Codex CLI, and Claude Code via local backends, but tool‑calling can be brittle:
    • Some small/older models fail with XML‑based tool schemas or loop on simple shell commands.
    • Workarounds include JSON‑tool‑aware custom CLIs, proxies, tuning repeat penalties, and temperature=0.
  • Context remains a key bottleneck for real projects: even with 100k–256k support, coding agents can rapidly exhaust windows when scanning multiple files. Subagents with separate contexts are suggested as mitigation.

Local vs Cloud Economics & Future Trajectory

  • Local inference is attractive for high‑volume, latency‑tolerant coding agents: API retries and tool‑call failures can inflate cloud costs by 40–60%.
  • Counterpoint: cheap Chinese APIs (DeepSeek, GLM, Kimi) plus high tps from providers may undercut home hardware when utilization is high.
  • Broader debate:
    • One side expects open/local to always lag frontier significantly due to data and training cost moats; huge models (possibly 1–2T params) are seen as inherently superior.
    • Others argue “good enough” small models will win for many tasks, as with consumer cars vs supercars; future architectural advances and distillation could shift the size–capability frontier.
    • Concern that hardware makers favor datacenters over consumers, risking a future where powerful compute is mostly rented.

Anthropic / Claude Code & Competition Concerns

  • Several participants cancelled Claude Code after Anthropic blocked use of individual subscription plans (Max/Pro) via third‑party agents like OpenCode, or after bans for wrapping Claude Code in custom interfaces.
  • Defenders note:
    • Subscription plans were marketed for Anthropic’s own tools, not as a general API; they are oversubscribed and subsidized under assumed usage patterns.
    • Heavy third‑party agent use breaks those assumptions; users are expected to pay API rates for that.
  • Critics call this anti‑competitive: cheap tokens are effectively tied to Claude Code, making it harder for independent agents to compete on equal pricing.
  • Broader anxiety: dependence on a few frontier providers, possible future “enshittification,” and risk that access can be revoked arbitrarily. Many see open/local models (including large ones rented on generic cloud VMs) as essential for long‑term autonomy.

Misc Technical & Conceptual Points

  • Clarifications:
    • SWE‑Bench “agent turns” chart is just a boxplot of turn distributions per task, not error bars.
    • Context management (truncation, cleanup) is always outside the model; Qwen itself just consumes text.
  • Some worry about CCP‑aligned censorship; replies suggest open weights can be fine‑tuned or “unaligned” if desired.
  • Users request better standardized benchmarks for “local” scenarios (time‑to‑first‑token, tps, memory, context) on standard hardware classes, and clearer terminology distinguishing true self‑hosted vs LAN vs hosted “local” tools.