The Cost of Being Crawled: LLM Bots and Vercel Image API Pricing
AI crawlers: block vs. cooperate
- Many commenters advocate outright blocking LLM/AI crawlers, calling them “leeches” that resell content without fair attribution or traffic back.
- Others propose serving minimal, machine‑readable content (e.g., markdown, plain text, special
llm.txtendpoints) to reduce bandwidth while still informing models and supporting future “generative engine optimization.” - Skeptics doubt AI bots will respect any new conventions if they already ignore
robots.txt, and argue that LLM-based search by design reduces visits to source sites.
Robots.txt, bot identity, and bad behavior
- Multiple reports that AI crawlers:
- Ignore
robots.txtandCrawl-Delay. - Hammer sites with huge spikes, retrying on errors and effectively causing partial DoS.
- Forge or rotate user agents (including ChatGPT and browsers) and use varied IP ranges (cloud and residential).
- Ignore
- Some see “verified bot” allow-lists as entrenching incumbents: big bots that already extracted data get whitelisted and new entrants are blocked.
- There’s criticism that the affected app itself crawls podcast feeds/images and may not honor
robots.txt, though the author argues this is standard practice in the podcast ecosystem, where hosts are designed for heavy RSS traffic.
Vercel image pricing, spend limits, and alternatives
- Many consider the original Vercel Image API pricing (e.g., ~$5/1,000 optimizations) “insanely expensive,” especially given how cheaply image resizing can be done with tools like ImageMagick, Thumbor, imgproxy, or a low-cost VPS/CDN combo (e.g., BunnyCDN).
- Vercel staff note they now use cheaper transformation-based pricing and offer soft/hard spend limits (“spend management”), though the UX confused the affected team.
- Some see this as an example of “PaaS/ignorance tax” and vendor lock‑in: attractive free tiers, then sharp costs once real traffic arrives.
Infrastructure, caching, and mitigation strategies
- Commenters emphasize that a modest VPS plus good caching (nginx/Varnish/Cloudflare/CloudFront) can handle large volumes and bots cheaply.
- Experiences differ: some say well‑tuned platforms easily absorb multiple crawlers; others describe AI bots overwhelming sites even behind CDNs.
- Suggestions include stricter rate limiting, IP-range blocking, challenging unidentified bots, offloading more work to clients, and treating the marketing site as a first‑class, performance‑critical component.