VibeVoice: A Frontier Open-Source Text-to-Speech Model
Perceived Audio Quality
- Many listeners find the demos very impressive and initially easy to mistake for real speakers, especially if “guard is down.”
- Others hear strong “uncanny valley” traits: odd intonation, robotic modulation, tone wobbles, and a “low bitrate / Bluetooth mic / mp3-compressed” sound, especially in male voices.
- Several note metallic / “blocky” timbre and that speakers never interrupt, stutter, or overlap as humans do, with longer-than-human pauses between turns.
- Some point out mismatched room acoustics between voices (e.g., reverb on male but not female), hurting realism.
Voices, Emotion, and Control
- Female voices are widely judged more convincing and expressive than male ones; some speculate this reflects where effort and investment went.
- Users want finer control of emotion, emphasis, and timing (stress on specific syllables/phonemes) via SSML-like tags or markup; current models mostly modulate loudness/duration.
- Voice cloning is praised as “just works,” even capturing emotional tone from samples.
- Singing is almost universally panned as “painfully bad”; some think it should have been omitted.
Multilingual and Accent Capabilities
- English–Mandarin examples are repeatedly highlighted as standout: smooth language switching and convincingly “second-language” accents in both directions.
- Reports of convincing Finnish output with minimal accent; Chinese output is generally rated good but some samples have strong American-accented Mandarin.
- Users wish for genuinely good British (and regional, e.g., Brummie) accents and support for smaller languages like Croatian.
Comparisons to Other TTS Systems
- Compared frequently with ElevenLabs (closed), which many still consider superior overall, especially for voice acting and tools like voice changing and markup.
- Open(-ish) competitors mentioned: Kokoro, Chatterbox, Dia, Orpheus, Higgs Audio, F5/Fish-TTS, CosyVoice, XTTS-2, Sesame, VUI, Unmute, etc., with mixed opinions over which sounds most natural.
- Some feel VibeVoice is SOTA in open models; others think several alternatives or ChatGPT voice sound clearly better.
Performance and Practicality
- On CPU-only or older GPUs, VibeVoice is extremely slow and can develop artifacts when using lower-precision formats, making smaller models like Kokoro more attractive for “GPU-poor” setups.
- This sparks debate about whether heavy, slow “AI TTS” is worth it vs traditional, instant system TTS (e.g., on macOS), especially when “acceptable” quality is enough for accessibility.
- Counterarguments: human-like prosody matters for long-form listening (audiobooks, articles, translation, dubbing, assistive speech), where classic TTS quickly becomes grating.
Licensing and “Open Source” Concerns
- Model is described as MIT-licensed, which some value for corporate compliance versus “non-commercial” licenses.
- Others argue calling a weights-only release “open source” without training data is misleading and violates the spirit (if not the letter) of open source.
- Later, the public GitHub repo is taken down, then restored with code removed and a note saying it’s a research framework temporarily disabled due to uses “inconsistent with the stated intent” and responsible-AI concerns.
- Commenters question what misuse occurred and what practical purpose the takedown serves when copies and MIT-licensed weights already circulate.
Ecosystem, Tooling, and Miscellaneous Reactions
- People share links to TTS leaderboards and Hugging Face lists to discover top models; some tools (like llm-tts or Kokoro-FastAPI) help compare many models uniformly.
- Questions arise about SSML support, IPA input, and relationship to other Microsoft voice models; answers remain mostly unclear.
- Some users can’t get the web demo or notebook to match showcased quality or encounter UI glitches.
- The “VibeVoice” name triggers jokes about “vibe coding,” Microsoft naming history, and conflicts with an existing open-source project of the same name.