2025-09-23

Qwen3-VL

Benchmarks, Claims, and Positioning

Release praised for unusually extensive benchmarking; some appreciate lack of obvious cherry-picking, others argue many benchmarks are saturated or contaminated and should be retired.
Several commenters accept that Qwen3‑VL may be SOTA among multimodal models, including versus proprietary ones, though others say it’s only marginally better than existing closed models.
Desire for comparisons with other strong open models (e.g., GLM) and criticism of specific benchmarks like OSWorld as “deeply flawed.”
One commenter notes little apparent architectural novelty (vision encoder + projector + autoregressive LLM), while another points to prior Qwen work like DeepStack as genuine innovation.

Multimodal Capabilities: Impressive and Fragile

Strong real‑world reports: handles low‑quality, messy invoice images better than custom CV+OCR pipelines (OpenCV, Tesseract, GPT‑4o), and can output bounding boxes to improve OCR.
Video demo (identifying goal timing, scorer, and method in a ~100‑minute match) impresses many.
Others note limits: still struggles with edge cases like animals photoshopped with extra limbs, dice faces (D20), and other rare patterns; tends to “correct” images toward typical anatomy even when told they’re edited.
General sentiment: excellent practical VLM, but far from robust general vision understanding; still highly dependent on what’s well represented in its training data.

Open Source Leadership, China, and Geopolitics

Several see Qwen (and DeepSeek before it) as proof that open models are no longer “catching up” but actually leading in many areas.
Strong appreciation for releasing such a large multimodal model as open weights, with some users already swapping it in for GPT‑4.1‑mini or similar in production agents at significantly lower token costs.
Extensive debate about Chinese strategy:
- Motives suggested include undercutting US AI incumbents, commoditizing models to sell hardware, ensuring strong Chinese‑language performance, talent competition, narrative control, and soft power.
- Others argue Chinese labs have effectively “blank checks” via state priorities, with expectations of serving social control rather than profit.
- Pushback against treating “the Chinese” as a single agent; some call that orientalist and say credit should go to specific teams, not a whole country.
- Security concerns raised about sending data to Chinese‑hosted chat frontends, even if weights are open and can be run locally.

Model Zoo, Naming, and Product Confusion

Confusion around Qwen’s lineup is a recurring complaint:
- Qwen3‑VL‑235B‑A22B‑Instruct vs Qwen3‑VL‑Plus vs qwen‑plus‑2025‑09‑11 vs various “Omni” and “Next”/“Thinking” variants.
- “Plus” generally understood as closed‑weight API models vs open‑weight downloadable ones, but users say it’s still unclear which API model is “better” for a given use case.
Commenters note that opaque, marketing‑heavy model naming is widespread across AI vendors, though some think DeepSeek/Claude are clearer.

Developer Experience and Use Cases

Users report:
- Using the “Thinking” variants successfully for workflow automation and replacing GPT‑4.1‑mini in agentic systems with similar quality at lower cost.
- Using Qwen multimodal for image captioning, meal/user photo tagging, and complex document understanding.
Tools recommended for newcomers: LM Studio and AnythingLLM for easy local use; Qwen’s own chat site for quick tests (with security caveats).
Some find smaller, older Qwen variants (e.g., QwQ / Qwen 2.5 VLM 7B) still preferable for specific tasks once fine‑tuned.

Cost, Pricing, and Efficiency

Qwen3‑VL API pricing is reported as substantially cheaper than top proprietary models: roughly ~1/10 of one leading model and ~1/2–1/3 of another on a per‑token basis, depending on the source quoted.
Users highlight big practical savings when swapping into existing workflows, with no obvious quality drop in their domains.
Broader discussion about commoditization: some argue widespread high‑quality open models will pop the US AI stock bubble; others respond that value will just move up‑stack rather than disappear.

Running Large Models Locally

Many are excited by the 235B open weights but question feasibility of self‑hosting:
- FP16 size implies ~512GB RAM; even with quantization (e.g., q8 around ~235GB), consumer GPUs are far from sufficient without multiple very expensive cards.
- 8× 32GB GPUs or datacenter cards (H200‑class) are considered out of reach for small players; multi‑node setups without NVLink suffer massive performance hits.
Suggested “borderline feasible” local setups:
- High‑RAM unified memory systems (e.g., 128GB+ GMKtec Evo 2 or 96GB+ Strix Halo / Framework Desktop) for smaller or MoE models, accepting modest tokens/s.
- High‑bandwidth GPUs (e.g., 96GB workstation cards) or very wide‑channel DDR5 Threadripper‑class CPUs for CPU‑bound inference.
Several warn that even expensive high‑RAM Macs or desktops will feel like “having a pen pal, not an assistant” for ≥70B dense models; MoE models fare better.
Some argue that for most users, cloud inference remains more economical than spending ~$10k+ on fast local hardware.

Limitations, Skepticism, and Open Questions

Skepticism about benchmark overuse, vision robustness, and lack of clear architectural breakthroughs.
Questions remain about:
- How Qwen3‑VL compares head‑to‑head with other new multimodal leaders (e.g., Omni models).
- Whether smaller, more practical Qwen3‑VL variants will be released.
- How to meaningfully evaluate vision‑language models beyond saturated leaderboards and hand‑picked demos.

Related topics