We replaced H.264 streaming with JPEG screenshots (and it worked better)
Use Case and Approach
- System streams what is essentially a remote coding session: an AI agent editing code in a sandbox, viewed in the browser.
- Original design used low-latency H.264 over WebRTC/WebSockets; replacement is periodic JPEG screenshots fetched over HTTPS.
“Why Not Just Send Text?”
- Multiple commenters question why pixels are streamed at all:
- For terminal-like output or code, sending text diffs or higher-level editor state would be far more efficient.
- Others note the agent may use full GUIs, browsers, or arbitrary apps, making pure text insufficient.
- Some argue the entire “watch the agent type in real time” model is misguided; review diffs asynchronously instead.
JPEG / MJPEG vs H.264
- Several people point out this is effectively reinventing MJPEG (or intra-only H.264), a decades‑old technique.
- Practitioners report similar past successes with JPEG/MJPEG for drones, remote desktops, browsers, and security cameras: simple, robust, low-latency.
- Many criticize the H.264 setup:
- 40 Mbps for 1080p text is described as absurd; 1–2 Mbps with proper settings is considered more than enough.
- Complaints that tuning bitrate, GOP, VBR/CBR, keyframe intervals, and frame rate was apparently not seriously attempted.
- Using only keyframes is seen as a misuse of video codecs that are efficient precisely because of inter-frame prediction.
Congestion Control and Why JPEG “Works”
- Key technical insight often highlighted: the JPEG polling loop is a crude but effective congestion control:
- Client requests next frame only after the previous is fully received, so frames don’t pile up in buffers.
- With H.264 over a single TCP stream, lack of explicit backpressure handling led to massive buffering and 30–45s latency.
- Commenters note this behavior is not inherent to JPEG; it’s a property of the pull model and not queuing frames.
Existing Protocols and Alternatives
- Many suggest using mature solutions instead of rolling custom stacks:
- VNC/RFB with tiling, diffs, and CopyRect; xrdp + x264; HLS/DASH/LL‑HLS; WebRTC with TURN over 443; SSE or streaming HTTP fallbacks.
- Some propose JPEG/WebP/WebM with WebCodecs or HLS-style chunking rather than per-frame polling.
- Others note PNG is too slow to encode/decode for this use, despite better text fidelity.
Enterprise Networks and Corporate IT
- Strong agreement that enterprise constraints (HTTPS/443 only, TLS MITM, broken WebSockets/SSE, intrusive DLP) heavily shape design.
- Some argue WebSockets and WebRTC-over-TURN on 443 now work in most corporate environments; others report ongoing breakage.
Perception of Engineering and LLM Use
- Several readers feel the post reflects shallow understanding of video engineering and overreliance on LLM-generated code and prose.
- Others praise the pragmatic outcome: a “dumb” but working solution that favors simplicity, even if technically suboptimal.