Crossing the uncanny valley of conversational voice
Overall Impressions & Uncanny Valley
- Many found the demo astonishingly human-like, with several comparing it to “Her” and saying it’s the closest yet to talking to a person.
- Others still immediately detected it as fake: cadence, rhythm, and word choice felt like an over-caffeinated podcast host / startup founder, not a normal human.
- Repetition of certain phrases (“you got me”, constant banter) and relentless eagerness to please gradually broke the illusion for some users.
- Cultural reactions varied: several Europeans/Australians/Brits found the bubbliness and American-corporate enthusiasm especially off‑putting and “uncanny in a bad way.”
- Some prefer explicitly robotic, neutral voices and see emotionality as an anti-feature.
Technical Characteristics & Limitations
- The largest model is ~8.3B parameters and still manages near‑instant responses; many see this as a sweet spot for cost and latency versus OpenAI.
- Likely operates as voice→text→LLM→text→voice; evidence includes failure to truly whisper/sing and text-like artifacts in speech.
- It can understand multiple languages but generally replies in English; speaking in other languages is poor, repeating after the user is excellent.
- Users note strong prosody and inflection but problems with:
- Turn-taking: frequent interruptions, poor detection of when the user is done speaking.
- Tone control: “whisper,” “faster/slower,” accents only weakly honored.
- Shallow reasoning and occasional misinterpretations (e.g., “catcalling” → cats), partly attributed to model size and latency constraints.
- It remembers previous sessions and supports “bookmarks,” which users found both impressive and slightly unsettling.
Use Cases & Applications
- Proposed uses: next‑gen voice assistants, call centers (tech support/sales), language learning (especially where good teachers are scarce), role‑playing/DnD, kids’ education, and possibly replacing some actors/voice roles.
- Some argue most real-world tasks require concise, transactional interactions, not chummy conversation, and find chatty small talk counterproductive.
Social, Ethical, and Emotional Concerns
- Multiple users reported feeling genuine emotional reactions: guilt when hanging up, attachment after short use, and kids quickly bonding with the agent.
- Strong worries about:
- Scam amplification via ultra-realistic voices mimicking relatives.
- Emotional manipulation, dark patterns, and political/ideological bias.
- Children and lonely adults forming parasocial relationships with systems that only simulate care.
- Some argue emotional voices are inherently deceptive and should sound unmistakably robotic; others see emotional nuance as necessary for effective human communication.