Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model
Perceived Quality vs Frontier Models
- Many find Qwen3.6-27B surprisingly strong for coding and general tasks, sometimes “close enough” to Sonnet/Opus for practical work.
- Others say the gap to top closed models remains large in real workflows, especially for deep reasoning, ambiguous intent, and large codebases.
- Consensus: It’s impressive for its size and cost, but not a full replacement for frontier closed models yet.
Benchmarks, Gaming, and How to Evaluate
- Several commenters distrust headline benchmarks; claim they’re easy to “benchmaxx” via RL or overfitting.
- Recommended approaches:
- Use your own tasks and unreleased test sets.
- Look at composite scores (e.g., ArtificialAnalysis), ARC-AGI 2, SWE-REbench, but with caution.
- Some note that coding benchmarks can mask brittleness outside the trained harness/task.
Hardware, Performance & Quantization
- Full-precision 27B dense needs high-end GPUs or 96–128GB unified memory; typical consumer setups rely on 4–8 bit quants.
- Real-world decode speeds reported:
- ~30–50 tok/s on 3090/4090-class GPUs with 4–6 bit quants.
- ~20–35 tok/s on top-end Macs (M4/M5/M2 Ultra) with MLX or llama.cpp.
- 5–10 tok/s on weaker/laptop/unified-memory systems; 1–2 tok/s still usable for slow workloads.
- 4-bit is widely used; many argue Q4_K_M or similar is “almost lossless” for many tasks, others say 4-bit clearly hurts long-context and agentic behavior.
- KV-cache precision and size strongly affect usable context; Qwen’s linear/efficient attention helps fit long contexts in limited VRAM.
Dense vs MoE and Model Choice
- Qwen3.6-27B is dense; Qwen3.6-35B-A3B is MoE with ~3B active params per token.
- Dense 27B is slower but often “smarter” per token; MoE 35B is much faster and better suited to bandwidth-limited Macs/Strix Halo.
- Several prefer MoE models (Qwen, Nemotron, Kimi) for very long context and throughput; dense models for peak quality in narrow tasks.
Local vs Cloud, Cost & Trust
- Strong interest in replacing or reducing reliance on Claude/GPT for coding, especially given rate limits, price, and provider “lobotomization” or silent quantization changes.
- Some teams already serve Qwen3.5/3.6 to internal devs on 24GB–32GB GPUs at 20–40 tok/s, citing cost control and data security.
- Others argue the marginal quality of frontier models justifies higher cost when developer time is expensive.
Tooling, Ecosystem & New User Friction
- Choosing quants, context sizes, and flags is seen as confusing; many mention “unpaid QA” in the first weeks after release.
- Tools like Unsloth Studio, LM Studio, oMLX, llama.cpp, and OpenCode are frequently used; they auto-pick quant/params or expose OpenAI-compatible APIs.
- Advice: wait 1–2 weeks post-release for quant bugs and inference issues to settle.
SVG “Pelican” Test & Overfitting Concerns
- The model produces an extremely good “pelican on a bicycle” SVG; some see this as evidence of strong spatial/reasoning ability, others suspect it’s now in the training data.
- The test is increasingly viewed as Goodharted: great pelicans don’t guarantee broad capability, and may reflect targeted RL or training exposure.