2025-05-05

Show HN: Real-time AI Voice Chat at ~500ms Latency

Speech-to-Text and TTS Choices

STT: Current setup uses Whisper via faster_whisper/CTRanslate2; several commenters note Whisper is still the default, though new models (e.g., Parakeet) may be better for English-only and need evaluation.
TTS: Coqui XTTSv2 is chosen for its very low time-to-first-audio (~<100ms) and quality; Kokoro and Orpheus are supported but slower or lower quality.
Some argue newer models like Dia have better voice quality, but the author and others report Dia is too slow, VRAM-hungry, and sometimes unstable for real-time agents.
Audio models are reported to be sensitive to quantization; quality degrades noticeably with heavy compression.

Latency, Pipeline, and “Real-time”

Reported breakdown on a 4090: ~220ms to first LLM fragment, ~80ms TTS to first audio chunk, STT/VAD/turn model all in tens of ms, giving ~500ms end-to-end.
Some see 500ms as “gold standard” for voice agents; audio engineers note this is high by recording standards but acceptable for AI assistants.
Others argue Whisper’s architecture isn’t ideal for streaming and that current “real-time” results largely come from throwing high-end GPUs at the problem.

Turn Detection, Interrupts, and Natural Conversation

System combines VAD (Silero) with a fast sentence-completion classifier to decide end-of-turn, aiming to avoid cutting users off mid-thought.
Interrupts initially triggered on raw voice activity caused too many false positives; using streaming transcription as the trigger improved accuracy.
Big thread on “turn-taking”: users want support for long pauses, mid-sentence thinking, active listening (“uh-huh”, “right”), and subtle cues rather than crude silence thresholds.
Suggestions include: specialized turn-prediction models, small LLMs estimating “done speaking” probability, streaming re-generation, wake-word models, and eventually unified audio-to-audio models (e.g., Moshi, Sesame-like systems).

Voices, Persona, and UX

Default custom “Lasinya” voice and “girlfriend” persona are polarizing: some praise responsiveness; others find the style/affect off-putting or bordering on mimicry of specific dialects.
Users want: shorter, less sycophantic replies; configurable voices; bilingual TTS; SSML-style prosody control (e.g., rising pitch on questions).

Hardware, Platforms, and Installation Friction

Current setup assumes a strong NVIDIA GPU (e.g., 24GB VRAM with a 24B model). AMD users report pain; some references to AMD/Vulkan workarounds and other frameworks.
Raspberry Pi and typical VPSs are seen as too weak for this full stack in real time.
Many comments vent about Python/CUDA dependency hell (especially on Windows), with calls for better packaging (conda/uv, Docker) and explicit environment support.

Related topics