VibeVoice: A Frontier Open-Source Text-to-Speech Model

Perceived Audio Quality

  • Many listeners find the demos very impressive and initially easy to mistake for real speakers, especially if “guard is down.”
  • Others hear strong “uncanny valley” traits: odd intonation, robotic modulation, tone wobbles, and a “low bitrate / Bluetooth mic / mp3-compressed” sound, especially in male voices.
  • Several note metallic / “blocky” timbre and that speakers never interrupt, stutter, or overlap as humans do, with longer-than-human pauses between turns.
  • Some point out mismatched room acoustics between voices (e.g., reverb on male but not female), hurting realism.

Voices, Emotion, and Control

  • Female voices are widely judged more convincing and expressive than male ones; some speculate this reflects where effort and investment went.
  • Users want finer control of emotion, emphasis, and timing (stress on specific syllables/phonemes) via SSML-like tags or markup; current models mostly modulate loudness/duration.
  • Voice cloning is praised as “just works,” even capturing emotional tone from samples.
  • Singing is almost universally panned as “painfully bad”; some think it should have been omitted.

Multilingual and Accent Capabilities

  • English–Mandarin examples are repeatedly highlighted as standout: smooth language switching and convincingly “second-language” accents in both directions.
  • Reports of convincing Finnish output with minimal accent; Chinese output is generally rated good but some samples have strong American-accented Mandarin.
  • Users wish for genuinely good British (and regional, e.g., Brummie) accents and support for smaller languages like Croatian.

Comparisons to Other TTS Systems

  • Compared frequently with ElevenLabs (closed), which many still consider superior overall, especially for voice acting and tools like voice changing and markup.
  • Open(-ish) competitors mentioned: Kokoro, Chatterbox, Dia, Orpheus, Higgs Audio, F5/Fish-TTS, CosyVoice, XTTS-2, Sesame, VUI, Unmute, etc., with mixed opinions over which sounds most natural.
  • Some feel VibeVoice is SOTA in open models; others think several alternatives or ChatGPT voice sound clearly better.

Performance and Practicality

  • On CPU-only or older GPUs, VibeVoice is extremely slow and can develop artifacts when using lower-precision formats, making smaller models like Kokoro more attractive for “GPU-poor” setups.
  • This sparks debate about whether heavy, slow “AI TTS” is worth it vs traditional, instant system TTS (e.g., on macOS), especially when “acceptable” quality is enough for accessibility.
  • Counterarguments: human-like prosody matters for long-form listening (audiobooks, articles, translation, dubbing, assistive speech), where classic TTS quickly becomes grating.

Licensing and “Open Source” Concerns

  • Model is described as MIT-licensed, which some value for corporate compliance versus “non-commercial” licenses.
  • Others argue calling a weights-only release “open source” without training data is misleading and violates the spirit (if not the letter) of open source.
  • Later, the public GitHub repo is taken down, then restored with code removed and a note saying it’s a research framework temporarily disabled due to uses “inconsistent with the stated intent” and responsible-AI concerns.
  • Commenters question what misuse occurred and what practical purpose the takedown serves when copies and MIT-licensed weights already circulate.

Ecosystem, Tooling, and Miscellaneous Reactions

  • People share links to TTS leaderboards and Hugging Face lists to discover top models; some tools (like llm-tts or Kokoro-FastAPI) help compare many models uniformly.
  • Questions arise about SSML support, IPA input, and relationship to other Microsoft voice models; answers remain mostly unclear.
  • Some users can’t get the web demo or notebook to match showcased quality or encounter UI glitches.
  • The “VibeVoice” name triggers jokes about “vibe coding,” Microsoft naming history, and conflicts with an existing open-source project of the same name.