Show HN: Real-time AI Voice Chat at ~500ms Latency

Speech-to-Text and TTS Choices

  • STT: Current setup uses Whisper via faster_whisper/CTRanslate2; several commenters note Whisper is still the default, though new models (e.g., Parakeet) may be better for English-only and need evaluation.
  • TTS: Coqui XTTSv2 is chosen for its very low time-to-first-audio (~<100ms) and quality; Kokoro and Orpheus are supported but slower or lower quality.
  • Some argue newer models like Dia have better voice quality, but the author and others report Dia is too slow, VRAM-hungry, and sometimes unstable for real-time agents.
  • Audio models are reported to be sensitive to quantization; quality degrades noticeably with heavy compression.

Latency, Pipeline, and “Real-time”

  • Reported breakdown on a 4090: ~220ms to first LLM fragment, ~80ms TTS to first audio chunk, STT/VAD/turn model all in tens of ms, giving ~500ms end-to-end.
  • Some see 500ms as “gold standard” for voice agents; audio engineers note this is high by recording standards but acceptable for AI assistants.
  • Others argue Whisper’s architecture isn’t ideal for streaming and that current “real-time” results largely come from throwing high-end GPUs at the problem.

Turn Detection, Interrupts, and Natural Conversation

  • System combines VAD (Silero) with a fast sentence-completion classifier to decide end-of-turn, aiming to avoid cutting users off mid-thought.
  • Interrupts initially triggered on raw voice activity caused too many false positives; using streaming transcription as the trigger improved accuracy.
  • Big thread on “turn-taking”: users want support for long pauses, mid-sentence thinking, active listening (“uh-huh”, “right”), and subtle cues rather than crude silence thresholds.
  • Suggestions include: specialized turn-prediction models, small LLMs estimating “done speaking” probability, streaming re-generation, wake-word models, and eventually unified audio-to-audio models (e.g., Moshi, Sesame-like systems).

Voices, Persona, and UX

  • Default custom “Lasinya” voice and “girlfriend” persona are polarizing: some praise responsiveness; others find the style/affect off-putting or bordering on mimicry of specific dialects.
  • Users want: shorter, less sycophantic replies; configurable voices; bilingual TTS; SSML-style prosody control (e.g., rising pitch on questions).

Hardware, Platforms, and Installation Friction

  • Current setup assumes a strong NVIDIA GPU (e.g., 24GB VRAM with a 24B model). AMD users report pain; some references to AMD/Vulkan workarounds and other frameworks.
  • Raspberry Pi and typical VPSs are seen as too weak for this full stack in real time.
  • Many comments vent about Python/CUDA dependency hell (especially on Windows), with calls for better packaging (conda/uv, Docker) and explicit environment support.