Qwen3-TTS family is now open sourced: Voice design, clone, and generation
Perceived Voice Style and Quality
- Many listeners feel the English demos sound like anime dubs, YouTube personalities, or tween-drama podcasts—highly “performed” and sometimes exaggerated.
- Some note the Japanese samples are also anime-like, and at least one Japanese line is mispronounced, leading to skepticism about Japanese quality.
- Others point out the prompts explicitly encourage that style, and that more “normal” voices lower on the page sound fine.
- The Obama and other celebrity-style clones are described as distinct and impressively close; several say many voices clone better than 11labs, though at a lower bitrate.
- One user found cloning captured vocal tone well but not natural intonation, resulting in a somewhat flat, monotonous delivery—possibly due to using a base model or missing expressiveness controls.
- Another reports emotional instability with the 0.6B model (unwanted laughter/moans between chunks); suggestions include more detailed style/emotion prompts.
Use Cases and Creative Potential
- Strong enthusiasm for:
- Audiobooks (where other TTS still struggle).
- Restoring / remastering old radio plays and damaged recordings where words can be inferred from context.
- Indie games and projects, including accent correction of non-native voice actors.
- Personalized content: podcasts, using one’s own voice, or deceased relatives reading children’s books.
- Dubbing movies into other languages while retaining something like the “original voice”.
Safety, Scams, and Societal Impact
- Multiple comments call the tech “terrifying” and see it as crossing a major threshold: realistic voice and image deepfakes are now accessible to almost anyone.
- Concrete fears: family/emergency scams using cloned faces and voices; erosion of trust in digital evidence and greater plausible deniability (“AI made that”).
- Mitigations discussed:
- Pre-agreed “secret words” within families.
- Cryptographic provenance systems like C2PA.
- Web3/NFT ideas are mentioned but questioned as to how they’d distinguish human vs AI-assisted content.
- Others argue the benefits (democratized creativity, new forms of games, film, and music) will outweigh downsides over time, though several note a painful transition period and impacts on creative livelihoods.
Running Locally and Performance
- 0.6B model runs on older GPUs like a GTX 1080 but often slower than real time; 1.7B uses ~6GB VRAM and is more robust to background noise.
- Lack of FlashAttention significantly slows inference; some users run fine without it, others are blocked by install issues.
- Mac support initially unclear, later confirmed via MLX-based tooling; CPU-only is reported as possible but slow, and edge-device viability remains uncertain.
- Hugging Face demos exist but can be overloaded; local CLI/frontends and example scripts are shared in the thread.