We survived 10k requests/second: Switching to signed asset URLs in an emergency
Incident & root cause
- A public Google Cloud Storage bucket was hit with ~10k requests/sec for ~7 hours, causing a large egress bill.
- Access to individual objects was public; attackers obtained object URLs via the public API rather than bucket listing.
- The fix was to switch to signed URLs and add rate limiting through the application stack.
Signed URLs: purpose & implementation
- Several commenters clarify that GCS/S3-style signed URLs are generated locally via HMAC using stored credentials; no remote API call is required.
- The observed ~250 ms latency likely came from using a higher-level API (e.g., per-file signing that triggers HTTP calls) rather than direct crypto.
- Advice: use bucket-level signing APIs instead of per-object ones to avoid extra round-trips.
- Some argue that unguessable object names plus no list access can also mitigate scraping without requiring daily re-auth via signed URLs.
CDNs, WAF, and rate limiting
- Many say the “correct” pattern is: private bucket → CDN (CloudFront/Cloud CDN/Cloudflare) → WAF/rate limiting at the edge.
- This blocks direct bucket access, lets you enforce per-IP or per-session limits, and offloads bandwidth to edge caches.
- Concern: even with signed URLs, an attacker can brute-force the API that issues them unless rate limiting exists there as well.
- Edge-level checks (session-cookie HMACs, X-Accel-Redirect, WAF rate limiting) are recommended as cheaper than pushing traffic into app servers.
Cost, architecture, and alternatives
- Strong debate on cloud vs simpler setups:
- Critics call the current architecture overengineered for the traffic level and note that 10k rps of static files is trivial on a single modern server.
- Others point out that many devs lack ops skills; managed cloud reduces operational burden at higher dollar cost.
- Alternatives raised: Hetzner/OVH bare metal, DigitalOcean droplets, Backblaze B2 + Fastly, Cloudflare R2 (zero egress), home-brewed setups.
- Some emphasize opportunity cost: time spent building “cheap infra” vs building product features.
Performance & scale skepticism
- Multiple commenters state 10k rps is not inherently high; 20–50k rps is feasible on modest hardware, especially for static content.
- Others note that bandwidth, response size, and database limits (e.g., Postgres connection caps) can become bottlenecks before CPU.
Security & robustness concerns
- Warnings about open redirects and URL-parsing edge cases when accepting URLs from users to sign.
- Recommendation to ensure bucket is private, to validate buckets/paths before signing, and to avoid trusting language URL parsers blindly.
- General consensus: rate limiting and backoff should be designed in early at multiple layers, not bolted on after an incident.
Misc suggestions
- Ideas include publishing periodic data snapshots to archive.org, using IP-level firewalls against abusive scrapers, and asking the cloud provider for billing relief after such incidents.