Voxtral Transcribe 2

Core capabilities & limitations

  • Realtime model (Voxtral Mini 4B Realtime) is open-weight and Apache 2.0, but does not support diarization.
  • Diarization exists only in Voxtral Mini Transcribe V2, which is not open-weight and not realtime.
  • Realtime model is designed for streaming use (low-latency conversations), batch model for offline file transcription.

Quality, comparisons & benchmarks

  • Many commenters find the realtime demo “off the charts” vs prior open models, including Whisper and Nvidia Parakeet/Nemotron, especially for fluent English and normal speech rates.
  • Others note failures on fast or sloppy speech, music-heavy audio, and some code-switched or accented input.
  • Word Error Rate claims (~4%) are seen as impressive but potentially misleading; WER differences between systems that do punctuation/normalization differently make direct comparison tricky.
  • Some report Parakeet v3 dropping sentences or stuttering; v2 considered more stable. Several still prefer Parakeet for small, on-device setups.
  • There is demand for independent, up-to-date ASR leaderboards; vendor cherry-picking is distrusted.

Language coverage & behavior

  • Strong performance reported for English, Spanish, Italian, French, German, Mandarin; struggles on unsupported or low-resource languages.
  • Bengali speech transcribed as Hindi; Polish/Ukrainian often mapped to Russian or mixed scripts; Ukrainian users find this particularly frustrating.
  • Debate over “phonetically advanced” Italian and whether language properties explain its low error rates; others cite research suggesting similar information rates across languages.
  • Discussion on multilingual vs monolingual models:
    • Some want narrower, faster single-language models.
    • Others argue multilingual is necessary for code-switching and loanwords in real life.

Pricing & economics

  • Voxtral non-realtime pricing ($0.003/min audio) is seen as much cheaper than AWS Transcribe ($0.024/min) and competitive with Whisper hosting (Deepinfra, fal.ai, etc.).
  • Some users calculate 10‑year subscription costs and compare to “buy once, own forever” software.

Tooling, deployment & UX

  • Realtime weights are ~7–9 GB; intended for GPU or edge devices via vLLM; too large for today’s in-browser inference.
  • Several want better reference implementations; current story leans heavily on vLLM nightly and remote demos.
  • Users report mixed success with the Hugging Face demo (CSP/adblock/mic issues, some browsers failing).
  • Active interest in: local Linux tools, Android keyboards, desktop apps (Handy, Spokenly, custom scripts) and voice agents using the realtime API.

Other concerns

  • Some are uneasy about giving voice data to cloud models due to cloning/scam risks, though others note mics already leak voice widely.
  • Requests for: realtime translation, diarization in open models, turn detection for voice agents, and domain-specific benchmarks (medical, legal, dev jargon).