We replaced H.264 streaming with JPEG screenshots (and it worked better)

Use Case and Approach

  • System streams what is essentially a remote coding session: an AI agent editing code in a sandbox, viewed in the browser.
  • Original design used low-latency H.264 over WebRTC/WebSockets; replacement is periodic JPEG screenshots fetched over HTTPS.

“Why Not Just Send Text?”

  • Multiple commenters question why pixels are streamed at all:
    • For terminal-like output or code, sending text diffs or higher-level editor state would be far more efficient.
    • Others note the agent may use full GUIs, browsers, or arbitrary apps, making pure text insufficient.
    • Some argue the entire “watch the agent type in real time” model is misguided; review diffs asynchronously instead.

JPEG / MJPEG vs H.264

  • Several people point out this is effectively reinventing MJPEG (or intra-only H.264), a decades‑old technique.
  • Practitioners report similar past successes with JPEG/MJPEG for drones, remote desktops, browsers, and security cameras: simple, robust, low-latency.
  • Many criticize the H.264 setup:
    • 40 Mbps for 1080p text is described as absurd; 1–2 Mbps with proper settings is considered more than enough.
    • Complaints that tuning bitrate, GOP, VBR/CBR, keyframe intervals, and frame rate was apparently not seriously attempted.
    • Using only keyframes is seen as a misuse of video codecs that are efficient precisely because of inter-frame prediction.

Congestion Control and Why JPEG “Works”

  • Key technical insight often highlighted: the JPEG polling loop is a crude but effective congestion control:
    • Client requests next frame only after the previous is fully received, so frames don’t pile up in buffers.
    • With H.264 over a single TCP stream, lack of explicit backpressure handling led to massive buffering and 30–45s latency.
  • Commenters note this behavior is not inherent to JPEG; it’s a property of the pull model and not queuing frames.

Existing Protocols and Alternatives

  • Many suggest using mature solutions instead of rolling custom stacks:
    • VNC/RFB with tiling, diffs, and CopyRect; xrdp + x264; HLS/DASH/LL‑HLS; WebRTC with TURN over 443; SSE or streaming HTTP fallbacks.
    • Some propose JPEG/WebP/WebM with WebCodecs or HLS-style chunking rather than per-frame polling.
    • Others note PNG is too slow to encode/decode for this use, despite better text fidelity.

Enterprise Networks and Corporate IT

  • Strong agreement that enterprise constraints (HTTPS/443 only, TLS MITM, broken WebSockets/SSE, intrusive DLP) heavily shape design.
  • Some argue WebSockets and WebRTC-over-TURN on 443 now work in most corporate environments; others report ongoing breakage.

Perception of Engineering and LLM Use

  • Several readers feel the post reflects shallow understanding of video engineering and overreliance on LLM-generated code and prose.
  • Others praise the pragmatic outcome: a “dumb” but working solution that favors simplicity, even if technically suboptimal.