Qwen3-Next

Architecture, Linear Attention, and MTP

  • Discussion highlights Qwen3‑Next’s hybrid architecture (linear attention, Gated Delta/Attention, MoE) as a genuine departure from “standard” transformer stacks.
  • Multi‑Token Prediction (MTP) is seen as a key innovation: predicts multiple future tokens with a shared head, avoiding huge extra unembedding matrices.
  • Several comments unpack how MTP enables self‑speculative decoding: generate token n with the full model and speculative n+1…n+k cheaply, then validate; if guesses are right, you effectively batch ahead “for free.”
  • Some confusion around speculative decoding mechanics is resolved: “checking” still costs a forward pass, but batching and reuse across turns makes it worthwhile. MTP itself mainly helps inference, not pretraining.

Quality, Steerability, and Overfitting

  • One thread claims Qwen models feel overfit and “stubborn”: great at known patterns (standard math/coding tasks) but hard to steer into alternative reasoning modes or code understanding/reversal.
  • Compared to top closed models, people report weaker out‑of‑distribution generalization and steerability, with some users also seeing odd, almost “fraying” dialogue and hallucinations.
  • ASCII SpongeBob is used as a memorization probe; larger Qwen coder variants often reproduce a specific web ASCII, suggesting rote recall. Some argue this indicates strong learning; others see it as memorization over generalization.

MoE Efficiency, VRAM, and Local Running

  • Enthusiasm around MoE: 80B total parameters with ~3B active per token, often running as fast as or faster than mid‑size dense models.
  • Extensive debate on VRAM requirements: rule‑of‑thumb parameter→memory conversions, impact of 4‑bit quantization, and how much can be offloaded to CPU RAM.
  • Disagreement over practical CPU/GPU swapping of experts: some report usable setups with partial offload; others point to massive bandwidth penalties and 5× slower generation when experts run on CPU.
  • Users confirm fully offline use is possible; estimates range from ~50–200GB RAM (or mixed VRAM+RAM) for comfortable runs.

Context Length and Long-Context Behavior

  • Qwen3‑Next advertises 262k native context and up to 1M with RoPE scaling (YaRN), but Qwen’s hosted chat currently exposes only 262k, so some stick to earlier 1M‑context models.
  • Several argue that nominal context length ≠ reliable retrieval: many frontier models degrade badly when context is saturated, though others report good multi‑hundred‑kilotoken workflows (e.g., entire repos as XML).

Benchmarks, Comparisons, and Skepticism

  • The blog claims Qwen3‑Next‑80B matches the larger 235B MoE on many tasks and outperforms it on ultra‑long‑context; some users testing it disagree, finding it clearly weaker than 235B and only around GPT‑OSS‑20B on one coding benchmark.
  • Concerns are raised about “benchmaxxing” in 2025; some want to see results on independent closed benchmarks and broad suites before trusting the claims.
  • Others report strong subjective impressions: chat quality close to the 235B model but noticeably faster, and very competitive pricing on some hosting platforms.

MoE vs Dense and Ecosystem Direction

  • Commenters frame Qwen3‑Next as evidence that large sparse MoE is now decisively better than older 70B+ dense models on a speed–quality basis.
  • There is debate over how novel Qwen’s contribution really is, given that state‑of‑the‑art closed models have been MoE for some time; nonetheless, many see Qwen as pushing open‑weights MoE forward more aggressively than previous releases.

Compute Demand and Jevons-Style Arguments

  • Some speculate that 10× efficiency gains could undercut the business case for massive new datacenters and cloud LLM APIs.
  • Others counter with Jevons‑style reasoning: cheaper, faster inference will enable more demanding models, higher reasoning budgets, continuous agents, and pervasive embedding in software, driving more total compute, not less.
  • There’s disagreement on current AI penetration in domains like customer support and software engineering, but broad consensus that much potential demand remains untapped.

Miscellaneous Notes

  • Newcomers express confusion over text vs image variants; commenters clarify that Qwen3‑Next is text‑only, separate from Qwen Image models.
  • Some users report “strange hallucinations” and unstable behavior; others praise the model’s long‑context performance and Alibaba’s steady cadence of strong open releases.
  • Minor grumbling about the “Next” naming convention and broken content loading on the Qwen website.