OpenAI’s WebRTC problem

WebRTC complexity and design trade‑offs

  • Many commenters find WebRTC painful: huge protocol surface (SDP, ICE, STUN/TURN, RTP, many RFCs), tricky handshakes, and over‑engineering for an internet that “doesn’t exist” (heavy P2P/NAT assumptions).
  • Others argue the complexity reflects real‑time media reality: NAT traversal, jitter buffers, congestion control, codecs, encryption, echo cancellation, and browser ubiquity are hard problems WebRTC already solves.
  • There’s frustration that even simple client–server scenarios must still go through the full P2P-style signaling dance.
  • Some note that WebRTC has tuning knobs (FEC, jitter buffer params, PLC, etc.) and that many glitches in OpenAI’s voice seem more like inference/model issues than transport.

Latency vs accuracy in voice AI

  • One camp: for voice AI, it’s more important not to drop audio than to be perfectly real time. LLMs can handle gaps; users might accept ~200–500 ms extra delay for correctness.
  • Opposing camp: in human–AI voice conversations every millisecond counts. Extra 200–500 ms on top of model and tool latencies makes interactions feel broken, increases interruptions, and kills engagement.
  • Several people running large voice systems report that going from ~1.2–1.5 s to ~700 ms turn latency dramatically improved usability; they are spending heavily to shave off another ~100–200 ms.
  • Some call the “latency vs reliability” framing a false dichotomy: TCP/WebSockets can stream audio immediately while still retransmitting losses, with the main challenge moving to jitter buffering (needed only on the human‑output side).

Alternatives: WebTransport, WebSockets, QUIC, MoQ

  • WebTransport + QUIC, RTP-over-QUIC, HTTP/2, and MoQ are discussed as ways to get WebRTC-like behavior with more control and without the P2P baggage.
  • Downsides: WebTransport server setup is non‑trivial, support is still maturing; MoQ is promising but niche; rolling your own over WebSockets forces you to re‑implement jitter, congestion, AEC, etc.
  • Some see a future in QUIC-based media stacks; others doubt MoQ’s alignment with end‑to‑end problems like echo cancellation and device DSP.

Real‑world reports and UX observations

  • Some practitioners report good results with WebRTC-based stacks (e.g., LiveKit, Daily, mesh/managed clouds, pipecat) for large‑scale voice agents, despite setup pain.
  • Others find non‑WebRTC systems (e.g., persistent HTTP/2 connections similar to smart speakers) adequate and simpler for server‑centric voice AI.
  • Several note that users strongly feel latency in voice; some argue users implicitly trade accuracy for responsiveness, while others insist correctness must remain primary, especially for high‑stakes scenarios.
  • There is side discussion about IPv6 potentially simplifying routing, and about browser APIs (including WebRTC) having many undocumented edge cases and weak timestamp semantics.