2025-09-12

Qwen3-Next

Architecture, Linear Attention, and MTP

Discussion highlights Qwen3‑Next’s hybrid architecture (linear attention, Gated Delta/Attention, MoE) as a genuine departure from “standard” transformer stacks.
Multi‑Token Prediction (MTP) is seen as a key innovation: predicts multiple future tokens with a shared head, avoiding huge extra unembedding matrices.
Several comments unpack how MTP enables self‑speculative decoding: generate token n with the full model and speculative n+1…n+k cheaply, then validate; if guesses are right, you effectively batch ahead “for free.”
Some confusion around speculative decoding mechanics is resolved: “checking” still costs a forward pass, but batching and reuse across turns makes it worthwhile. MTP itself mainly helps inference, not pretraining.

Quality, Steerability, and Overfitting

One thread claims Qwen models feel overfit and “stubborn”: great at known patterns (standard math/coding tasks) but hard to steer into alternative reasoning modes or code understanding/reversal.
Compared to top closed models, people report weaker out‑of‑distribution generalization and steerability, with some users also seeing odd, almost “fraying” dialogue and hallucinations.
ASCII SpongeBob is used as a memorization probe; larger Qwen coder variants often reproduce a specific web ASCII, suggesting rote recall. Some argue this indicates strong learning; others see it as memorization over generalization.

MoE Efficiency, VRAM, and Local Running

Enthusiasm around MoE: 80B total parameters with ~3B active per token, often running as fast as or faster than mid‑size dense models.
Extensive debate on VRAM requirements: rule‑of‑thumb parameter→memory conversions, impact of 4‑bit quantization, and how much can be offloaded to CPU RAM.
Disagreement over practical CPU/GPU swapping of experts: some report usable setups with partial offload; others point to massive bandwidth penalties and 5× slower generation when experts run on CPU.
Users confirm fully offline use is possible; estimates range from ~50–200GB RAM (or mixed VRAM+RAM) for comfortable runs.

Context Length and Long-Context Behavior

Qwen3‑Next advertises 262k native context and up to 1M with RoPE scaling (YaRN), but Qwen’s hosted chat currently exposes only 262k, so some stick to earlier 1M‑context models.
Several argue that nominal context length ≠ reliable retrieval: many frontier models degrade badly when context is saturated, though others report good multi‑hundred‑kilotoken workflows (e.g., entire repos as XML).

Benchmarks, Comparisons, and Skepticism

The blog claims Qwen3‑Next‑80B matches the larger 235B MoE on many tasks and outperforms it on ultra‑long‑context; some users testing it disagree, finding it clearly weaker than 235B and only around GPT‑OSS‑20B on one coding benchmark.
Concerns are raised about “benchmaxxing” in 2025; some want to see results on independent closed benchmarks and broad suites before trusting the claims.
Others report strong subjective impressions: chat quality close to the 235B model but noticeably faster, and very competitive pricing on some hosting platforms.

MoE vs Dense and Ecosystem Direction

Commenters frame Qwen3‑Next as evidence that large sparse MoE is now decisively better than older 70B+ dense models on a speed–quality basis.
There is debate over how novel Qwen’s contribution really is, given that state‑of‑the‑art closed models have been MoE for some time; nonetheless, many see Qwen as pushing open‑weights MoE forward more aggressively than previous releases.

Compute Demand and Jevons-Style Arguments

Some speculate that 10× efficiency gains could undercut the business case for massive new datacenters and cloud LLM APIs.
Others counter with Jevons‑style reasoning: cheaper, faster inference will enable more demanding models, higher reasoning budgets, continuous agents, and pervasive embedding in software, driving more total compute, not less.
There’s disagreement on current AI penetration in domains like customer support and software engineering, but broad consensus that much potential demand remains untapped.

Miscellaneous Notes

Newcomers express confusion over text vs image variants; commenters clarify that Qwen3‑Next is text‑only, separate from Qwen Image models.
Some users report “strange hallucinations” and unstable behavior; others praise the model’s long‑context performance and Alibaba’s steady cadence of strong open releases.
Minor grumbling about the “Next” naming convention and broken content loading on the Qwen website.

Related topics