2025-03-02

Crossing the uncanny valley of conversational voice

Overall Impressions & Uncanny Valley

Many found the demo astonishingly human-like, with several comparing it to “Her” and saying it’s the closest yet to talking to a person.
Others still immediately detected it as fake: cadence, rhythm, and word choice felt like an over-caffeinated podcast host / startup founder, not a normal human.
Repetition of certain phrases (“you got me”, constant banter) and relentless eagerness to please gradually broke the illusion for some users.
Cultural reactions varied: several Europeans/Australians/Brits found the bubbliness and American-corporate enthusiasm especially off‑putting and “uncanny in a bad way.”
Some prefer explicitly robotic, neutral voices and see emotionality as an anti-feature.

Technical Characteristics & Limitations

The largest model is ~8.3B parameters and still manages near‑instant responses; many see this as a sweet spot for cost and latency versus OpenAI.
Likely operates as voice→text→LLM→text→voice; evidence includes failure to truly whisper/sing and text-like artifacts in speech.
It can understand multiple languages but generally replies in English; speaking in other languages is poor, repeating after the user is excellent.
Users note strong prosody and inflection but problems with:
- Turn-taking: frequent interruptions, poor detection of when the user is done speaking.
- Tone control: “whisper,” “faster/slower,” accents only weakly honored.
- Shallow reasoning and occasional misinterpretations (e.g., “catcalling” → cats), partly attributed to model size and latency constraints.
It remembers previous sessions and supports “bookmarks,” which users found both impressive and slightly unsettling.

Use Cases & Applications

Proposed uses: next‑gen voice assistants, call centers (tech support/sales), language learning (especially where good teachers are scarce), role‑playing/DnD, kids’ education, and possibly replacing some actors/voice roles.
Some argue most real-world tasks require concise, transactional interactions, not chummy conversation, and find chatty small talk counterproductive.

Social, Ethical, and Emotional Concerns

Multiple users reported feeling genuine emotional reactions: guilt when hanging up, attachment after short use, and kids quickly bonding with the agent.
Strong worries about:
- Scam amplification via ultra-realistic voices mimicking relatives.
- Emotional manipulation, dark patterns, and political/ideological bias.
- Children and lonely adults forming parasocial relationships with systems that only simulate care.
Some argue emotional voices are inherently deceptive and should sound unmistakably robotic; others see emotional nuance as necessary for effective human communication.

Related topics