2025-12-10

Qwen3-Omni-Flash-2025-12-01：a next-generation native multimodal large model

Hallucinations and uncertainty

A user test (resistor count in a specific guitar pedal) showed a confident but wrong answer, highlighting persistent hallucinations.
Several comments argue that models don’t need to know obscure trivia, but they must know when they don’t know.
There’s interest in a “cautiousness” control (like a slider from “only answer if very certain” to “feel free to guess”), but skepticism that mainstream chat products will do this because users tend to prefer confident answers.

Trivia as evaluation

Some see the resistor question as useless trivia; others say trivia is valid for testing hallucination behavior.
It’s noted that training capacity is limited and must prioritize useful, composable knowledge rather than arbitrary specifics.

Real-time speech-to-speech and local hosting

Qwen3-Omni-Flash appears to support native, real-time speech-to-speech, not just STT → LLM → TTS.
Running it locally is currently hard: major inference frameworks lack full support, especially on non-Nvidia hardware.
A few experimental deployments exist (e.g., vLLM-based, custom “Talker” support), but they’re early and sometimes fail subtle audio tests (e.g., distinguishing heteronyms like “record” noun vs verb).
Local voice-chat UX is described as immature; building robust, natural-language-driven workflows is seen as a big emerging area.

Voice quality and “AI accent”

Several people sense a “lifeless” quality in the demo voice: flat intonation, overly stable cadence.
Some prefer this neutral style, disliking ChatGPT-style “overly excited” Americanized voices, especially for European use cases.
There’s debate whether the system is truly end-to-end audio or relying on an intermediate TTS layer; behavior on accents, singing, and heteronyms is suggested as a test.

Model size, architecture, and benchmarks

One description: a stacked system with separate audio and vision encoders, a ~30B MoE language backbone (with ~3B active), an audio LLM, and an audio-token decoder.
Benchmarks show “Flash” beating much larger models (e.g., Qwen3-235B), prompting suspicion that it might be heavily trained on benchmark-adjacent data.
Multiple commenters warn that public benchmarks are unreliable for choosing models; private task-specific evaluation is recommended.

Open weights vs “Flash” and API-only confusion

The blog links to a Hugging Face collection, but those point to older Qwen3-Omni models; the new “Flash-2025-12-01” weights do not appear to be available.
Clarifications in-thread: “Flash” variants are closed-weight, higher-performing updates used on Qwen’s own chat, distinct from the older open-weight Omni-30B-A3B.
Several users find Qwen’s messaging around openness vs API-only offerings confusing, feeling misled into chasing non-existent downloads.

Tooling, platforms, and deployment questions

Mac users ask about GGUF/MLX-style local Omni with streaming mic/webcam; current suggestions (vLLM, Whisper, etc.) don’t fully satisfy the multimodal, real-time requirement.
Splitting internal “thinking” tokens from user-facing audio in realtime is identified as an unresolved design issue for native audio-token models.

Related topics