OpenAI’s WebRTC problem
WebRTC complexity and design trade‑offs
- Many commenters find WebRTC painful: huge protocol surface (SDP, ICE, STUN/TURN, RTP, many RFCs), tricky handshakes, and over‑engineering for an internet that “doesn’t exist” (heavy P2P/NAT assumptions).
- Others argue the complexity reflects real‑time media reality: NAT traversal, jitter buffers, congestion control, codecs, encryption, echo cancellation, and browser ubiquity are hard problems WebRTC already solves.
- There’s frustration that even simple client–server scenarios must still go through the full P2P-style signaling dance.
- Some note that WebRTC has tuning knobs (FEC, jitter buffer params, PLC, etc.) and that many glitches in OpenAI’s voice seem more like inference/model issues than transport.
Latency vs accuracy in voice AI
- One camp: for voice AI, it’s more important not to drop audio than to be perfectly real time. LLMs can handle gaps; users might accept ~200–500 ms extra delay for correctness.
- Opposing camp: in human–AI voice conversations every millisecond counts. Extra 200–500 ms on top of model and tool latencies makes interactions feel broken, increases interruptions, and kills engagement.
- Several people running large voice systems report that going from ~1.2–1.5 s to ~700 ms turn latency dramatically improved usability; they are spending heavily to shave off another ~100–200 ms.
- Some call the “latency vs reliability” framing a false dichotomy: TCP/WebSockets can stream audio immediately while still retransmitting losses, with the main challenge moving to jitter buffering (needed only on the human‑output side).
Alternatives: WebTransport, WebSockets, QUIC, MoQ
- WebTransport + QUIC, RTP-over-QUIC, HTTP/2, and MoQ are discussed as ways to get WebRTC-like behavior with more control and without the P2P baggage.
- Downsides: WebTransport server setup is non‑trivial, support is still maturing; MoQ is promising but niche; rolling your own over WebSockets forces you to re‑implement jitter, congestion, AEC, etc.
- Some see a future in QUIC-based media stacks; others doubt MoQ’s alignment with end‑to‑end problems like echo cancellation and device DSP.
Real‑world reports and UX observations
- Some practitioners report good results with WebRTC-based stacks (e.g., LiveKit, Daily, mesh/managed clouds, pipecat) for large‑scale voice agents, despite setup pain.
- Others find non‑WebRTC systems (e.g., persistent HTTP/2 connections similar to smart speakers) adequate and simpler for server‑centric voice AI.
- Several note that users strongly feel latency in voice; some argue users implicitly trade accuracy for responsiveness, while others insist correctness must remain primary, especially for high‑stakes scenarios.
- There is side discussion about IPv6 potentially simplifying routing, and about browser APIs (including WebRTC) having many undocumented edge cases and weak timestamp semantics.