Eleven v3
Voice quality vs human performance
- Many commenters find the English voices strikingly realistic, “almost indistinguishable” from real voice actors for short clips.
- A professional voice actor strongly disagrees: says it’s still far from professional work, with missing or forced emotion, flat/predictable delivery, odd timing, and fatiguing for long-form listening.
- Several note it sounds like polished radio ads rather than natural conversation; tone feels exaggerated in a uniform, “monotonous” way.
- Some see it as great for quick/low-effort content (TikTok, simple narration), but not yet acceptable for audiobooks or high-end acting.
Languages, accents & localization
- Consensus: American English is excellent; many other languages are inconsistent or bad.
- Reports of strong English accents, mid-sentence accent switches, or outright nonsense in: Russian, Romanian, Bulgarian, Italian, Greek, French, Portuguese, Swedish, Norwegian, Japanese, Kazakh, Spanish variants, Tagalog, etc.
- Some languages/voices fare better: Polish is praised, some German and Tamil samples are “okay to good,” but often still sound like an announcer or phone assistant.
- Quality is highly dependent on matching a native-language voice from the voice library; homepage demos are often worse.
- Accent handling (e.g., British, French-accented English) is hit-or-miss and sometimes comical.
- Site UI localization into non-English languages is described as clumsy, literal, and clearly non-native.
Pricing, business model & competition
- Pricing for v3 API is unclear; public API is “coming soon.” There’s an 80% discount via UI until mid‑2025 and startup grants for high tiers.
- Several complain about subscription + credit “funny money” models and “voice slots,” preferring pure pay‑as‑you‑go.
- Comparisons suggest Eleven is several times more expensive than OpenAI’s TTS at small scale, though may become competitive at very high tiers.
- Many say Eleven remains quality leader, but high prices create space for rivals and open source: Chatterbox, Kokoro, NVIDIA NeMo + XTTS, PlayHT, Hume, Mirage, etc.
Features, quirks & API
- v3 includes expressive tags (e.g., laughs), but laughter often sounds like a separate inserted segment rather than integrated into words.
- Some users observe limited but surprising singing behavior triggered by song lyrics or [verse]/[chorus] tags; quality is roughly “like a human who can’t sing.”
- Reports of number misreads, language-accent glitches, and voice-breaking changes from v2 to v3.
- Echo issues in voice agents are attributed by others to missing client-side echo cancellation.
- v3 is currently a research preview and not fully available via API yet.
User experience, ethics & aesthetics
- Strong unease about replacing human voice actors and narrators; some call it anti-human and depressing, especially when real voices are cloned.
- Audiobook users value human narrators as scarce curators; fear platforms will cut costs with AI and degrade the experience.
- Several dislike the “patronizing,” emotionally validating style in support scripts, expecting it to age into an obvious negative trope.
- Others simply find the demos insincere and would rather have minimal, task-focused machine voices.