2026-02-04

Voxtral Transcribe 2

Core capabilities & limitations

Realtime model (Voxtral Mini 4B Realtime) is open-weight and Apache 2.0, but does not support diarization.
Diarization exists only in Voxtral Mini Transcribe V2, which is not open-weight and not realtime.
Realtime model is designed for streaming use (low-latency conversations), batch model for offline file transcription.

Quality, comparisons & benchmarks

Many commenters find the realtime demo “off the charts” vs prior open models, including Whisper and Nvidia Parakeet/Nemotron, especially for fluent English and normal speech rates.
Others note failures on fast or sloppy speech, music-heavy audio, and some code-switched or accented input.
Word Error Rate claims (~4%) are seen as impressive but potentially misleading; WER differences between systems that do punctuation/normalization differently make direct comparison tricky.
Some report Parakeet v3 dropping sentences or stuttering; v2 considered more stable. Several still prefer Parakeet for small, on-device setups.
There is demand for independent, up-to-date ASR leaderboards; vendor cherry-picking is distrusted.

Language coverage & behavior

Strong performance reported for English, Spanish, Italian, French, German, Mandarin; struggles on unsupported or low-resource languages.
Bengali speech transcribed as Hindi; Polish/Ukrainian often mapped to Russian or mixed scripts; Ukrainian users find this particularly frustrating.
Debate over “phonetically advanced” Italian and whether language properties explain its low error rates; others cite research suggesting similar information rates across languages.
Discussion on multilingual vs monolingual models:
- Some want narrower, faster single-language models.
- Others argue multilingual is necessary for code-switching and loanwords in real life.

Pricing & economics

Voxtral non-realtime pricing ($0.003/min audio) is seen as much cheaper than AWS Transcribe ($0.024/min) and competitive with Whisper hosting (Deepinfra, fal.ai, etc.).
Some users calculate 10‑year subscription costs and compare to “buy once, own forever” software.

Tooling, deployment & UX

Realtime weights are ~7–9 GB; intended for GPU or edge devices via vLLM; too large for today’s in-browser inference.
Several want better reference implementations; current story leans heavily on vLLM nightly and remote demos.
Users report mixed success with the Hugging Face demo (CSP/adblock/mic issues, some browsers failing).
Active interest in: local Linux tools, Android keyboards, desktop apps (Handy, Spokenly, custom scripts) and voice agents using the realtime API.

Other concerns

Some are uneasy about giving voice data to cloud models due to cloning/scam risks, though others note mics already leak voice widely.
Requests for: realtime translation, diarization in open models, turn detection for voice agents, and domain-specific benchmarks (medical, legal, dev jargon).

Related topics