Qwen3.5: Towards Native Multimodal Agents

Quantization, MoE, and Local Inference

  • Discussion centers on whether 2–3 bit quantizations of huge models are better than smaller dense models at 8–16 bit.
  • Consensus: 4-bit (e.g., MXFP4) is usually the “sweet spot”; 2–3 bit often degrades quality but can remain usable for very large MoE models.
  • For MoE (e.g., 397B with ~17B active), inactive experts can be mmap’d from disk and KV cache offloaded to swap; performance then depends heavily on spare RAM and storage speed. No clear benchmarks; outcomes are workload-specific.
  • Some argue you must eval on your own tasks; many decisions are currently driven by “vibes” rather than rigorous calibration.

Context Length and Qwen3.5-Plus

  • Hosted Qwen3.5-Plus reportedly supports 1M tokens vs 200–262k “native” in open weights.
  • Commenters note they use YaRN-style scaling with caveats: can hurt short-context performance and may be best enabled only for long inputs.
  • OpenRouter exposes both base and Plus; Plus is cheaper under some context limits, implying proprietary inference optimizations.

RL Environments and Training Strategy

  • Qwen claims 15k RL environments; commenters infer this could include CLIs, GUIs, APIs, GitHub repos, games—anything with cheap, automatable feedback.
  • A speculative pipeline: mine GitHub, auto-classify repos as environments, auto-generate goals (e.g., introduce/fix bugs), then run large-scale RL.
  • View: each generation of models improves this pipeline, creating a “throw money at it” scaling regime for verifiable tasks; judgment-heavy tasks remain harder and risk LLM-judge bias.

Benchmarks, Benchmaxxing, and ARC-AGI

  • Many praise Qwen’s capabilities and fast iteration but repeatedly raise concerns about “benchmaxxing” and overfitting to public benchmarks.
  • ARC-AGI is cited as a counter-signal: open models (and even some proprietary ones) score poorly there despite strong mainstream benchmarks. Some argue ARC-AGI doesn’t map well to typical user needs.
  • Skeptics report that models advertised as “Sonnet 4.5-level” often collapse on real, complex work—especially once quantized for consumer hardware.

Hardware and Practical ‘Openness’

  • Debate over whether these “open” models are effectively cloud-only: 397B is beyond most local setups, but 80–120B-ish models plus aggressive quantization may run on 128–256GB Macs or Strix Halo APUs.
  • Strong disagreement over the usefulness of Apple silicon for serious LLM work: token generation can be fine, but prefill is often criticized as too slow for agentic workflows.
  • Some want smaller Qwen3.5 distills (80–110B, with vision) for 128GB devices; maintainers hint more sizes are coming.

Evaluation Oddities: Pelicans, Car Wash, and “Native Agents”

  • The “pelican on a bike” SVG test resurfaces as a folk benchmark for multimodal precision and hallucination; models now mostly produce bad-but-amusing SVGs, possibly due to training on earlier poor outputs.
  • Another meme test: “car wash 50–100m away—walk or drive?” Some models still misinterpret the question; others now handle it well.
  • Several commenters argue that beyond benchmarks, the real differentiator is whether “native multimodal agents” can maintain coherent multi-step tool use and long-horizon context without losing the thread.

Ecosystem, UX, and Miscellaneous

  • People note Qwen3.5 is already on OpenRouter with competitive pricing but no caching yet.
  • Requests for third-party SWE-bench-verified results; vendor self-reporting is treated with caution.
  • Multiple complaints about Qwen’s blog UX: dark-mode rendering issues, heavy PNG tables, auto-downloaded PDFs, and Safari privacy settings blocking content.