My Journey to a reliable and enjoyable locally hosted voice assistant (2025)
Frustrations with Mainstream Voice Assistants
- Many anecdotes of comically bad behavior: confusion over multiple timers, misinterpreting “stop” and “yes,” sending garbled texts, or doing web searches instead of simple actions.
- Speech-to-text is seen as “basically solved,” but intent recognition and dialog management are described as “brain-damaged,” especially for simple home-control tasks.
- Some users report nearly perfect experiences for basic timers/reminders, but others find reliability ~50%, which makes them stop using voice.
Do People Actually Like Talking to Assistants?
- Split views: some dislike speaking aloud or find it slower than using a phone; others consider needing to pull out a phone a “failure” of the smart home.
- Popular use cases: kitchen timers, lists, simple home controls, driving (navigation/media), and accessibility for motor impairments.
- Several only tolerate assistants for timers and weather because those are relatively robust.
Reliability Expectations & LLM Skepticism
- Debate over “99% reliability”: for critical infrastructure that’s unacceptable; for home automation some consider it good enough, but current LLM-based systems are perceived as worse.
- Concern about using indeterministic LLMs for core home automation behavior.
Wake-Word Detection & Activation UX
- A major pain point: poor wake-word detection, especially in open/Home Assistant-style devices; often worse than commercial smart speakers.
- Reports of strong bias in wake-word detection toward adult male voices; children and women trigger far less reliably.
- Ideas explored: custom wake-word training (microWakeWord), alternative triggers (buttons, claps, comm badges, wearables), and “done words” to mark end of speech.
- Some prefer physical buttons to avoid always-on listening; others argue that defeats hands-free scenarios (cooking, carrying things).
Local Assistants, Home Assistant Devices & Hardware
- Home Assistant Voice Preview hardware gets mixed reviews: easy setup and good integration, but weak mic/speaker quality, poorer wake-word reliability, awkward turn-taking, and no voice profiles.
- Locally hosted LLMs are seen as possible but resource-intensive; some offload reasoning to cloud models while keeping STT/TTS local.
- Audio front-end (beamforming arrays, noise handling, buffering) is seen as at least as critical as the LLM.
TTS/ASR Quality and Training Data
- For local setups, text-to-speech prosody is a major challenge; many current models sound robotic, especially for conversational speech, numbers, and hedged phrases.
- Some argue most home interactions need only simple chimes for success/failure, not full verbal responses.
- Automatic speech recognition also struggles with real-world, technical, and noisy environments; fine-tuning on personal data is suggested but data collection is hard.