Qwen3-Next
Architecture, Linear Attention, and MTP
- Discussion highlights Qwen3‑Next’s hybrid architecture (linear attention, Gated Delta/Attention, MoE) as a genuine departure from “standard” transformer stacks.
- Multi‑Token Prediction (MTP) is seen as a key innovation: predicts multiple future tokens with a shared head, avoiding huge extra unembedding matrices.
- Several comments unpack how MTP enables self‑speculative decoding: generate token n with the full model and speculative n+1…n+k cheaply, then validate; if guesses are right, you effectively batch ahead “for free.”
- Some confusion around speculative decoding mechanics is resolved: “checking” still costs a forward pass, but batching and reuse across turns makes it worthwhile. MTP itself mainly helps inference, not pretraining.
Quality, Steerability, and Overfitting
- One thread claims Qwen models feel overfit and “stubborn”: great at known patterns (standard math/coding tasks) but hard to steer into alternative reasoning modes or code understanding/reversal.
- Compared to top closed models, people report weaker out‑of‑distribution generalization and steerability, with some users also seeing odd, almost “fraying” dialogue and hallucinations.
- ASCII SpongeBob is used as a memorization probe; larger Qwen coder variants often reproduce a specific web ASCII, suggesting rote recall. Some argue this indicates strong learning; others see it as memorization over generalization.
MoE Efficiency, VRAM, and Local Running
- Enthusiasm around MoE: 80B total parameters with ~3B active per token, often running as fast as or faster than mid‑size dense models.
- Extensive debate on VRAM requirements: rule‑of‑thumb parameter→memory conversions, impact of 4‑bit quantization, and how much can be offloaded to CPU RAM.
- Disagreement over practical CPU/GPU swapping of experts: some report usable setups with partial offload; others point to massive bandwidth penalties and 5× slower generation when experts run on CPU.
- Users confirm fully offline use is possible; estimates range from ~50–200GB RAM (or mixed VRAM+RAM) for comfortable runs.
Context Length and Long-Context Behavior
- Qwen3‑Next advertises 262k native context and up to 1M with RoPE scaling (YaRN), but Qwen’s hosted chat currently exposes only 262k, so some stick to earlier 1M‑context models.
- Several argue that nominal context length ≠ reliable retrieval: many frontier models degrade badly when context is saturated, though others report good multi‑hundred‑kilotoken workflows (e.g., entire repos as XML).
Benchmarks, Comparisons, and Skepticism
- The blog claims Qwen3‑Next‑80B matches the larger 235B MoE on many tasks and outperforms it on ultra‑long‑context; some users testing it disagree, finding it clearly weaker than 235B and only around GPT‑OSS‑20B on one coding benchmark.
- Concerns are raised about “benchmaxxing” in 2025; some want to see results on independent closed benchmarks and broad suites before trusting the claims.
- Others report strong subjective impressions: chat quality close to the 235B model but noticeably faster, and very competitive pricing on some hosting platforms.
MoE vs Dense and Ecosystem Direction
- Commenters frame Qwen3‑Next as evidence that large sparse MoE is now decisively better than older 70B+ dense models on a speed–quality basis.
- There is debate over how novel Qwen’s contribution really is, given that state‑of‑the‑art closed models have been MoE for some time; nonetheless, many see Qwen as pushing open‑weights MoE forward more aggressively than previous releases.
Compute Demand and Jevons-Style Arguments
- Some speculate that 10× efficiency gains could undercut the business case for massive new datacenters and cloud LLM APIs.
- Others counter with Jevons‑style reasoning: cheaper, faster inference will enable more demanding models, higher reasoning budgets, continuous agents, and pervasive embedding in software, driving more total compute, not less.
- There’s disagreement on current AI penetration in domains like customer support and software engineering, but broad consensus that much potential demand remains untapped.
Miscellaneous Notes
- Newcomers express confusion over text vs image variants; commenters clarify that Qwen3‑Next is text‑only, separate from Qwen Image models.
- Some users report “strange hallucinations” and unstable behavior; others praise the model’s long‑context performance and Alibaba’s steady cadence of strong open releases.
- Minor grumbling about the “Next” naming convention and broken content loading on the Qwen website.