Crossing the uncanny valley of conversational voice

Overall Impressions & Uncanny Valley

  • Many found the demo astonishingly human-like, with several comparing it to “Her” and saying it’s the closest yet to talking to a person.
  • Others still immediately detected it as fake: cadence, rhythm, and word choice felt like an over-caffeinated podcast host / startup founder, not a normal human.
  • Repetition of certain phrases (“you got me”, constant banter) and relentless eagerness to please gradually broke the illusion for some users.
  • Cultural reactions varied: several Europeans/Australians/Brits found the bubbliness and American-corporate enthusiasm especially off‑putting and “uncanny in a bad way.”
  • Some prefer explicitly robotic, neutral voices and see emotionality as an anti-feature.

Technical Characteristics & Limitations

  • The largest model is ~8.3B parameters and still manages near‑instant responses; many see this as a sweet spot for cost and latency versus OpenAI.
  • Likely operates as voice→text→LLM→text→voice; evidence includes failure to truly whisper/sing and text-like artifacts in speech.
  • It can understand multiple languages but generally replies in English; speaking in other languages is poor, repeating after the user is excellent.
  • Users note strong prosody and inflection but problems with:
    • Turn-taking: frequent interruptions, poor detection of when the user is done speaking.
    • Tone control: “whisper,” “faster/slower,” accents only weakly honored.
    • Shallow reasoning and occasional misinterpretations (e.g., “catcalling” → cats), partly attributed to model size and latency constraints.
  • It remembers previous sessions and supports “bookmarks,” which users found both impressive and slightly unsettling.

Use Cases & Applications

  • Proposed uses: next‑gen voice assistants, call centers (tech support/sales), language learning (especially where good teachers are scarce), role‑playing/DnD, kids’ education, and possibly replacing some actors/voice roles.
  • Some argue most real-world tasks require concise, transactional interactions, not chummy conversation, and find chatty small talk counterproductive.

Social, Ethical, and Emotional Concerns

  • Multiple users reported feeling genuine emotional reactions: guilt when hanging up, attachment after short use, and kids quickly bonding with the agent.
  • Strong worries about:
    • Scam amplification via ultra-realistic voices mimicking relatives.
    • Emotional manipulation, dark patterns, and political/ideological bias.
    • Children and lonely adults forming parasocial relationships with systems that only simulate care.
  • Some argue emotional voices are inherently deceptive and should sound unmistakably robotic; others see emotional nuance as necessary for effective human communication.