2025-09-22

Qwen3-Omni: Native Omni AI model for text, image and video

Multimodal architecture & capabilities

Commenters are intrigued by the “thinker/speaker” setup and shared embedding space for text, image, audio, and video, likening it to human concepts that are not forced through text.
Some argue that all transformer-based LLMs ultimately work in “state space” before next-token prediction, but others note video/audio pipelines can be more complex (LLM + separate extractors, etc.).
Native audio–audio translation and general audio understanding (e.g., recognizing instruments) are seen as standout features compared to other multimodal models, where audio is less mature.

Demos, UX & voice experience

The official demo video—especially real-time speech translation with speech output—impressed many as one of the best public demos so far.
The web chat (chat.qwen.ai) offers many distinct voices; people found them entertaining, especially when using them in mismatched languages (e.g., heavy accents in Russian).
Some users found English voice pacing slow but Spanish fast; another struggled with a trip-planning session that stalled and started replying in Chinese.
There is confusion over how to mix text input and spoken output in the UI; voice mode is accessed via a separate audio icon.

Model variants, open weights & “Flash” models

Open weights: Qwen3-Omni-30B-A3B (~70 GB BF16) is praised for being large but still locally runnable after quantization (e.g., Q4 on 24 GB GPUs). Too big for smooth use on 16 GB unified memory Macs; SSD thrashing expected.
No mature macOS multimodal inference stack yet; audio/image/video together are seen as a higher bar than text-only.
Users note “Omni-Flash” models referenced in the paper as separate, in-house variants optimized for efficiency and dialect support; these appear to back the hosted real-time service rather than the open model.

Local deployment & home automation

Several people already run Qwen models locally (e.g., on dual 3090s, or laptops) and compare favorably to GPT‑4.1 for coding and general tasks.
One detailed setup: Qwen for reasoning + separate STT/TTS containers, integrated with Home Assistant and ESP32-S3-based “voice satellites” using ESPHome. Use cases include hands-free cooking help, home control, and even security-camera-driven automations.

Applications & quality

Users report strong OCR / information extraction: Qwen cleanly parsed difficult, low-quality invoices that a custom OCR+OpenAI pipeline struggled with.
Story generation is described as more natural and humorous than many other models.
Some slang/Internet culture (e.g., “sussy baka”) and mixed-modality control are weak spots.

Geopolitics, openness & market outlook

Strong thread on China’s aggressive open-weights strategy vs US labs’ closed, “moat”-driven approach.
Some foresee US attempts to restrict Chinese AI models (e.g., ITAR-like controls), while others doubt effective enforcement, comparing it to piracy.
Debate over market size for $1–2k private AI appliances: skeptics say most people will stick with cheap cloud subscriptions; others anticipate a sizable niche for privacy-preserving, on-prem “AI toasters,” especially for email and SMB use.
Multiple commenters stress that open weights constrain monopolistic pricing, shift value to compute, and foster a healthier research and tooling ecosystem.

Related topics