My Journey to a reliable and enjoyable locally hosted voice assistant (2025)

Frustrations with Mainstream Voice Assistants

  • Many anecdotes of comically bad behavior: confusion over multiple timers, misinterpreting “stop” and “yes,” sending garbled texts, or doing web searches instead of simple actions.
  • Speech-to-text is seen as “basically solved,” but intent recognition and dialog management are described as “brain-damaged,” especially for simple home-control tasks.
  • Some users report nearly perfect experiences for basic timers/reminders, but others find reliability ~50%, which makes them stop using voice.

Do People Actually Like Talking to Assistants?

  • Split views: some dislike speaking aloud or find it slower than using a phone; others consider needing to pull out a phone a “failure” of the smart home.
  • Popular use cases: kitchen timers, lists, simple home controls, driving (navigation/media), and accessibility for motor impairments.
  • Several only tolerate assistants for timers and weather because those are relatively robust.

Reliability Expectations & LLM Skepticism

  • Debate over “99% reliability”: for critical infrastructure that’s unacceptable; for home automation some consider it good enough, but current LLM-based systems are perceived as worse.
  • Concern about using indeterministic LLMs for core home automation behavior.

Wake-Word Detection & Activation UX

  • A major pain point: poor wake-word detection, especially in open/Home Assistant-style devices; often worse than commercial smart speakers.
  • Reports of strong bias in wake-word detection toward adult male voices; children and women trigger far less reliably.
  • Ideas explored: custom wake-word training (microWakeWord), alternative triggers (buttons, claps, comm badges, wearables), and “done words” to mark end of speech.
  • Some prefer physical buttons to avoid always-on listening; others argue that defeats hands-free scenarios (cooking, carrying things).

Local Assistants, Home Assistant Devices & Hardware

  • Home Assistant Voice Preview hardware gets mixed reviews: easy setup and good integration, but weak mic/speaker quality, poorer wake-word reliability, awkward turn-taking, and no voice profiles.
  • Locally hosted LLMs are seen as possible but resource-intensive; some offload reasoning to cloud models while keeping STT/TTS local.
  • Audio front-end (beamforming arrays, noise handling, buffering) is seen as at least as critical as the LLM.

TTS/ASR Quality and Training Data

  • For local setups, text-to-speech prosody is a major challenge; many current models sound robotic, especially for conversational speech, numbers, and hedged phrases.
  • Some argue most home interactions need only simple chimes for success/failure, not full verbal responses.
  • Automatic speech recognition also struggles with real-world, technical, and noisy environments; fine-tuning on personal data is suggested but data collection is hard.