WebSockets cost us $1M on our AWS bill
Overall theme
- Thread largely agrees the issue was CPU cost from inefficient IPC of raw video over WebSockets on loopback, not AWS data transfer.
- Many see it as a classic “PoC in production” that only becomes a problem at scale; others argue experienced systems engineers would have avoided it.
Architecture & root cause
- Recall used headless Chromium bots to join third‑party video calls, capture rendered frames, then send raw 1080p video via WebSockets to another process for encoding and analysis.
- Profiling showed heavy overhead from WebSocket fragmentation, masking, multiple memcpy operations, and general message framing on huge frames.
- They replaced this with a shared‑memory ring buffer, significantly reducing CPU usage and thus AWS compute cost.
Debate on design choices
- Many criticize decoding and compositing in Chromium, then shipping uncompressed frames instead of:
- Keeping streams compressed longer.
- Tapping into underlying codecs, WebRTC, or GPU pipelines.
- Using existing IPC/shm mechanisms (/dev/shm, Mojo, iceoryx2, Redis, etc.).
- Defenders point out constraints:
- They’re scraping many meeting platforms that don’t expose compressed streams publicly.
- Reverse‑engineering or negotiating private APIs for each provider would be brittle and slow.
- A naïve but robust solution let them validate the business first.
Technical nitpicks & disagreements
- Some say their discussion of MTU, fragmentation, and bandwidth shows shallow systems knowledge; others argue that since they ended up on shared memory, further TCP tuning is moot.
- Disagreement over how critical zero‑copy really is at 1080p given typical server memory bandwidth.
- Some question their use of CPU instead of GPU for rendering/encoding, given million‑dollar scale.
- There’s debate about lock‑free shared‑memory designs, atomics, and memory ordering, but general agreement that shared memory is the right class of solution.
AWS, cost, and framing
- Several commenters find the title misleading, expecting AWS egress or API Gateway WebSocket charges; the real issue is CPU hours on EC2.
- Long side‑discussion compares AWS prices vs dedicated servers/colos (e.g., Hetzner), with claims of large cost deltas but no consensus on trade‑offs.
- Some praise the transparency and postmortem; others think it highlights a lack of low‑level expertise.