2025-12-23

We replaced H.264 streaming with JPEG screenshots (and it worked better)

Use Case and Approach

System streams what is essentially a remote coding session: an AI agent editing code in a sandbox, viewed in the browser.
Original design used low-latency H.264 over WebRTC/WebSockets; replacement is periodic JPEG screenshots fetched over HTTPS.

“Why Not Just Send Text?”

Multiple commenters question why pixels are streamed at all:
- For terminal-like output or code, sending text diffs or higher-level editor state would be far more efficient.
- Others note the agent may use full GUIs, browsers, or arbitrary apps, making pure text insufficient.
- Some argue the entire “watch the agent type in real time” model is misguided; review diffs asynchronously instead.

JPEG / MJPEG vs H.264

Several people point out this is effectively reinventing MJPEG (or intra-only H.264), a decades‑old technique.
Practitioners report similar past successes with JPEG/MJPEG for drones, remote desktops, browsers, and security cameras: simple, robust, low-latency.
Many criticize the H.264 setup:
- 40 Mbps for 1080p text is described as absurd; 1–2 Mbps with proper settings is considered more than enough.
- Complaints that tuning bitrate, GOP, VBR/CBR, keyframe intervals, and frame rate was apparently not seriously attempted.
- Using only keyframes is seen as a misuse of video codecs that are efficient precisely because of inter-frame prediction.

Congestion Control and Why JPEG “Works”

Key technical insight often highlighted: the JPEG polling loop is a crude but effective congestion control:
- Client requests next frame only after the previous is fully received, so frames don’t pile up in buffers.
- With H.264 over a single TCP stream, lack of explicit backpressure handling led to massive buffering and 30–45s latency.
Commenters note this behavior is not inherent to JPEG; it’s a property of the pull model and not queuing frames.

Existing Protocols and Alternatives

Many suggest using mature solutions instead of rolling custom stacks:
- VNC/RFB with tiling, diffs, and CopyRect; xrdp + x264; HLS/DASH/LL‑HLS; WebRTC with TURN over 443; SSE or streaming HTTP fallbacks.
- Some propose JPEG/WebP/WebM with WebCodecs or HLS-style chunking rather than per-frame polling.
- Others note PNG is too slow to encode/decode for this use, despite better text fidelity.

Enterprise Networks and Corporate IT

Strong agreement that enterprise constraints (HTTPS/443 only, TLS MITM, broken WebSockets/SSE, intrusive DLP) heavily shape design.
Some argue WebSockets and WebRTC-over-TURN on 443 now work in most corporate environments; others report ongoing breakage.

Perception of Engineering and LLM Use

Several readers feel the post reflects shallow understanding of video engineering and overreliance on LLM-generated code and prose.
Others praise the pragmatic outcome: a “dumb” but working solution that favors simplicity, even if technically suboptimal.

Related topics