Voxtral Transcribe 2
Core capabilities & limitations
- Realtime model (Voxtral Mini 4B Realtime) is open-weight and Apache 2.0, but does not support diarization.
- Diarization exists only in Voxtral Mini Transcribe V2, which is not open-weight and not realtime.
- Realtime model is designed for streaming use (low-latency conversations), batch model for offline file transcription.
Quality, comparisons & benchmarks
- Many commenters find the realtime demo “off the charts” vs prior open models, including Whisper and Nvidia Parakeet/Nemotron, especially for fluent English and normal speech rates.
- Others note failures on fast or sloppy speech, music-heavy audio, and some code-switched or accented input.
- Word Error Rate claims (~4%) are seen as impressive but potentially misleading; WER differences between systems that do punctuation/normalization differently make direct comparison tricky.
- Some report Parakeet v3 dropping sentences or stuttering; v2 considered more stable. Several still prefer Parakeet for small, on-device setups.
- There is demand for independent, up-to-date ASR leaderboards; vendor cherry-picking is distrusted.
Language coverage & behavior
- Strong performance reported for English, Spanish, Italian, French, German, Mandarin; struggles on unsupported or low-resource languages.
- Bengali speech transcribed as Hindi; Polish/Ukrainian often mapped to Russian or mixed scripts; Ukrainian users find this particularly frustrating.
- Debate over “phonetically advanced” Italian and whether language properties explain its low error rates; others cite research suggesting similar information rates across languages.
- Discussion on multilingual vs monolingual models:
- Some want narrower, faster single-language models.
- Others argue multilingual is necessary for code-switching and loanwords in real life.
Pricing & economics
- Voxtral non-realtime pricing ($0.003/min audio) is seen as much cheaper than AWS Transcribe ($0.024/min) and competitive with Whisper hosting (Deepinfra, fal.ai, etc.).
- Some users calculate 10‑year subscription costs and compare to “buy once, own forever” software.
Tooling, deployment & UX
- Realtime weights are ~7–9 GB; intended for GPU or edge devices via vLLM; too large for today’s in-browser inference.
- Several want better reference implementations; current story leans heavily on vLLM nightly and remote demos.
- Users report mixed success with the Hugging Face demo (CSP/adblock/mic issues, some browsers failing).
- Active interest in: local Linux tools, Android keyboards, desktop apps (Handy, Spokenly, custom scripts) and voice agents using the realtime API.
Other concerns
- Some are uneasy about giving voice data to cloud models due to cloning/scam risks, though others note mics already leak voice widely.
- Requests for: realtime translation, diarization in open models, turn detection for voice agents, and domain-specific benchmarks (medical, legal, dev jargon).