2026-05-07

OpenAI’s WebRTC problem

WebRTC complexity and design trade‑offs

Many commenters find WebRTC painful: huge protocol surface (SDP, ICE, STUN/TURN, RTP, many RFCs), tricky handshakes, and over‑engineering for an internet that “doesn’t exist” (heavy P2P/NAT assumptions).
Others argue the complexity reflects real‑time media reality: NAT traversal, jitter buffers, congestion control, codecs, encryption, echo cancellation, and browser ubiquity are hard problems WebRTC already solves.
There’s frustration that even simple client–server scenarios must still go through the full P2P-style signaling dance.
Some note that WebRTC has tuning knobs (FEC, jitter buffer params, PLC, etc.) and that many glitches in OpenAI’s voice seem more like inference/model issues than transport.

Latency vs accuracy in voice AI

One camp: for voice AI, it’s more important not to drop audio than to be perfectly real time. LLMs can handle gaps; users might accept ~200–500 ms extra delay for correctness.
Opposing camp: in human–AI voice conversations every millisecond counts. Extra 200–500 ms on top of model and tool latencies makes interactions feel broken, increases interruptions, and kills engagement.
Several people running large voice systems report that going from ~1.2–1.5 s to ~700 ms turn latency dramatically improved usability; they are spending heavily to shave off another ~100–200 ms.
Some call the “latency vs reliability” framing a false dichotomy: TCP/WebSockets can stream audio immediately while still retransmitting losses, with the main challenge moving to jitter buffering (needed only on the human‑output side).

Alternatives: WebTransport, WebSockets, QUIC, MoQ

WebTransport + QUIC, RTP-over-QUIC, HTTP/2, and MoQ are discussed as ways to get WebRTC-like behavior with more control and without the P2P baggage.
Downsides: WebTransport server setup is non‑trivial, support is still maturing; MoQ is promising but niche; rolling your own over WebSockets forces you to re‑implement jitter, congestion, AEC, etc.
Some see a future in QUIC-based media stacks; others doubt MoQ’s alignment with end‑to‑end problems like echo cancellation and device DSP.

Real‑world reports and UX observations

Some practitioners report good results with WebRTC-based stacks (e.g., LiveKit, Daily, mesh/managed clouds, pipecat) for large‑scale voice agents, despite setup pain.
Others find non‑WebRTC systems (e.g., persistent HTTP/2 connections similar to smart speakers) adequate and simpler for server‑centric voice AI.
Several note that users strongly feel latency in voice; some argue users implicitly trade accuracy for responsiveness, while others insist correctness must remain primary, especially for high‑stakes scenarios.
There is side discussion about IPv6 potentially simplifying routing, and about browser APIs (including WebRTC) having many undocumented edge cases and weak timestamp semantics.

Related topics