VibeVoice: Open-source frontier voice AI

Model scope and capabilities

  • Covers multiple tasks: speech-to-text (ASR), long-form TTS, and streaming TTS.
  • A key differentiator highlighted is single‑pass transcription of up to ~60 minutes with built‑in speaker diarization.
  • Some see this as a major workflow win over common Whisper + Pyannote setups, which require chunking and separate diarization, often breaking speaker continuity.
  • At least one heavy user reports VibeVoice ASR as more reliable and “out-of-the-box functional” than Whisper and Parakeet, especially because diarization is integrated.

Quality, performance, and alternatives

  • Several commenters describe the ASR as heavy, slow, hallucination‑prone, and weak in multilingual settings, with others saying their results were “very poor.”
  • Others argue it’s “very good,” so perceived quality is mixed and likely data‑dependent.
  • It’s criticized as not being a new model and not state of the art; some note that open STT progress in accuracy has been limited since Whisper.
  • Alternatives frequently mentioned: Whisper, Parakeet, Voxtral (Mistral), Qwen, NVIDIA NeMo diarization, Speechmatics, ElevenLabs, and various “open weight” voice models.

TTS-specific feedback

  • The newer/remaining TTS models get poor reviews from some: missing documentation for one variant, a realtime model described as low quality, random music insertion, and issues with special characters.
  • The earlier 7B TTS (since pulled) is remembered by others as one of the most impressive local TTS models, though trained on noisy data (e.g., ad jingles), which can leak into outputs.

Open source vs “open weights”

  • Strong debate over calling this “open source” when only weights and inference code are available, not training code or datasets.
  • Many advocate for clearer labels like “open weights,” “open inference,” and explicit cues about license, source availability, and data openness.
  • Others argue the terminology battle is largely lost in practice, but some still see this as harmful “openwashing.”

Safety, misuse, and trust

  • The original TTS code was removed after reported misuse inconsistent with stated intent.
  • A separate Windows Store app (“vibing.exe”) is linked to allegations of harvesting screen, audio, and clipboard data, further fueling distrust.
  • Some commenters suspect subtle marketing/astroturfing around renewed attention to the repo.

Naming, branding, and ecosystem

  • The name “VibeVoice” is widely mocked, associated with “vibe-coded”/“slop” rather than reliability; many expected yet another “Copilot” brand.
  • Several note that Microsoft’s real advantage may be platform control rather than best‑in‑class models.