WebSockets cost us $1M on our AWS bill

Overall theme

  • Thread largely agrees the issue was CPU cost from inefficient IPC of raw video over WebSockets on loopback, not AWS data transfer.
  • Many see it as a classic “PoC in production” that only becomes a problem at scale; others argue experienced systems engineers would have avoided it.

Architecture & root cause

  • Recall used headless Chromium bots to join third‑party video calls, capture rendered frames, then send raw 1080p video via WebSockets to another process for encoding and analysis.
  • Profiling showed heavy overhead from WebSocket fragmentation, masking, multiple memcpy operations, and general message framing on huge frames.
  • They replaced this with a shared‑memory ring buffer, significantly reducing CPU usage and thus AWS compute cost.

Debate on design choices

  • Many criticize decoding and compositing in Chromium, then shipping uncompressed frames instead of:
    • Keeping streams compressed longer.
    • Tapping into underlying codecs, WebRTC, or GPU pipelines.
    • Using existing IPC/shm mechanisms (/dev/shm, Mojo, iceoryx2, Redis, etc.).
  • Defenders point out constraints:
    • They’re scraping many meeting platforms that don’t expose compressed streams publicly.
    • Reverse‑engineering or negotiating private APIs for each provider would be brittle and slow.
    • A naïve but robust solution let them validate the business first.

Technical nitpicks & disagreements

  • Some say their discussion of MTU, fragmentation, and bandwidth shows shallow systems knowledge; others argue that since they ended up on shared memory, further TCP tuning is moot.
  • Disagreement over how critical zero‑copy really is at 1080p given typical server memory bandwidth.
  • Some question their use of CPU instead of GPU for rendering/encoding, given million‑dollar scale.
  • There’s debate about lock‑free shared‑memory designs, atomics, and memory ordering, but general agreement that shared memory is the right class of solution.

AWS, cost, and framing

  • Several commenters find the title misleading, expecting AWS egress or API Gateway WebSocket charges; the real issue is CPU hours on EC2.
  • Long side‑discussion compares AWS prices vs dedicated servers/colos (e.g., Hetzner), with claims of large cost deltas but no consensus on trade‑offs.
  • Some praise the transparency and postmortem; others think it highlights a lack of low‑level expertise.