Show HN: I built a sub-500ms latency voice agent from scratch

Overall reaction & significance

  • Many find the write-up unusually clear and useful, especially the latency breakdown and the visual explanation of the loop.
  • Sub‑500ms is seen as an important UX threshold where voice agents start to feel conversational rather than IVR-like.
  • Several commenters built similar systems and confirm the framing: voice agents are primarily an orchestration and turn‑taking problem, not a pure model problem.

Latency, TTFT, and orchestration

  • Latency is described as distributed across the pipeline: network, STT, LLM, TTS, and telephony hops.
  • Time‑to‑first‑token (TTFT) is repeatedly called out as more important than total generation time. Streaming an acknowledgment early feels faster than delivering the whole answer slightly sooner.
  • Co‑location of services, warm WebSocket connections, and caching (e.g., common TTS phrases) are suggested as key optimizations.
  • Geography matters: when callers are far from the infra region (e.g., India → US‑East), carrier and edge routing add noticeable delay.

VAD, endpoint detection, and turn-taking

  • Multiple approaches are discussed: classical VAD, semantic “end of turn” based on text, and fused models that use both audio and semantics.
  • Some argue semantic endpoint detection or integrated turn‑taking models outperform raw VAD and fixed silence thresholds.
  • Handling “barge‑in” (user interrupting mid‑response) is highlighted as a complex area, especially when downstream actions (bookings, webhooks, DB writes) may already be in flight.

Cascaded STT→LLM→TTS vs end‑to‑end speech models

  • Some call cascaded pipelines a “dead end” and see end‑to‑end speech‑to‑speech (full‑duplex models, Moshi‑style) as the future.
  • Others with production experience argue cascades will persist because they’re auditable, modular, and easier to regulate and debug, especially for enterprises.
  • Concern is raised that end‑to‑end models can blow up KV/cache size and latency, especially on device.

UX, human timing, and filler speech

  • Several comments connect to conversation analysis: humans often have ~0ms or even negative gaps (predictive overlap) and use nonverbal cues, backchannels, and fillers.
  • Ideas: use small models or pre‑cached audio clips for “thinking” fillers (“hmm… let me check”), or even predictive response generation while the user is still speaking.
  • Others warn of “uncanny valley” risk if fillers are mistimed and of frustration when assistants interrupt during brief pauses.
  • Some suggest explicit verbal end markers (“over to you”) or keyword‑based turn endings to avoid mis‑detections, though this is seen as unnatural by some.

Frameworks, tooling, and alternatives

  • Commenters compare hand‑rolled Python orchestration to frameworks like LiveKit, Pipecat, and commercial STT/endpoint providers (Deepgram Flux, Soniox).
  • Some advocate building from scratch once to truly understand latency sources; others favor production‑grade frameworks for robustness and configurability.
  • Several share alternative stacks: Twilio, various STT/TTS vendors, Groq/Cerebras/Claude/Gemini, local Qwen/MLX pipelines, and fully in‑browser or offline agents.
  • Costs, reliability, and rate‑limit issues with some hosted LLMs are mentioned as practical constraints.

Big‑tech assistants and business constraints

  • There’s frustration that consumer assistants (Alexa/Siri/Google) feel dated compared to bespoke setups.
  • Possible reasons cited: GPU cost at scale, the need for very strong guardrails when controlling real‑world devices, weak monetization for simple voice queries, and the complexity of upgrading legacy assistants into fully agentic systems.