We survived 10k requests/second: Switching to signed asset URLs in an emergency

Incident & root cause

  • A public Google Cloud Storage bucket was hit with ~10k requests/sec for ~7 hours, causing a large egress bill.
  • Access to individual objects was public; attackers obtained object URLs via the public API rather than bucket listing.
  • The fix was to switch to signed URLs and add rate limiting through the application stack.

Signed URLs: purpose & implementation

  • Several commenters clarify that GCS/S3-style signed URLs are generated locally via HMAC using stored credentials; no remote API call is required.
  • The observed ~250 ms latency likely came from using a higher-level API (e.g., per-file signing that triggers HTTP calls) rather than direct crypto.
  • Advice: use bucket-level signing APIs instead of per-object ones to avoid extra round-trips.
  • Some argue that unguessable object names plus no list access can also mitigate scraping without requiring daily re-auth via signed URLs.

CDNs, WAF, and rate limiting

  • Many say the “correct” pattern is: private bucket → CDN (CloudFront/Cloud CDN/Cloudflare) → WAF/rate limiting at the edge.
  • This blocks direct bucket access, lets you enforce per-IP or per-session limits, and offloads bandwidth to edge caches.
  • Concern: even with signed URLs, an attacker can brute-force the API that issues them unless rate limiting exists there as well.
  • Edge-level checks (session-cookie HMACs, X-Accel-Redirect, WAF rate limiting) are recommended as cheaper than pushing traffic into app servers.

Cost, architecture, and alternatives

  • Strong debate on cloud vs simpler setups:
    • Critics call the current architecture overengineered for the traffic level and note that 10k rps of static files is trivial on a single modern server.
    • Others point out that many devs lack ops skills; managed cloud reduces operational burden at higher dollar cost.
  • Alternatives raised: Hetzner/OVH bare metal, DigitalOcean droplets, Backblaze B2 + Fastly, Cloudflare R2 (zero egress), home-brewed setups.
  • Some emphasize opportunity cost: time spent building “cheap infra” vs building product features.

Performance & scale skepticism

  • Multiple commenters state 10k rps is not inherently high; 20–50k rps is feasible on modest hardware, especially for static content.
  • Others note that bandwidth, response size, and database limits (e.g., Postgres connection caps) can become bottlenecks before CPU.

Security & robustness concerns

  • Warnings about open redirects and URL-parsing edge cases when accepting URLs from users to sign.
  • Recommendation to ensure bucket is private, to validate buckets/paths before signing, and to avoid trusting language URL parsers blindly.
  • General consensus: rate limiting and backoff should be designed in early at multiple layers, not bolted on after an incident.

Misc suggestions

  • Ideas include publishing periodic data snapshots to archive.org, using IP-level firewalls against abusive scrapers, and asking the cloud provider for billing relief after such incidents.