2025-04-14

The Cost of Being Crawled: LLM Bots and Vercel Image API Pricing

AI crawlers: block vs. cooperate

Many commenters advocate outright blocking LLM/AI crawlers, calling them “leeches” that resell content without fair attribution or traffic back.
Others propose serving minimal, machine‑readable content (e.g., markdown, plain text, special llm.txt endpoints) to reduce bandwidth while still informing models and supporting future “generative engine optimization.”
Skeptics doubt AI bots will respect any new conventions if they already ignore robots.txt, and argue that LLM-based search by design reduces visits to source sites.

Robots.txt, bot identity, and bad behavior

Multiple reports that AI crawlers:
- Ignore robots.txt and Crawl-Delay.
- Hammer sites with huge spikes, retrying on errors and effectively causing partial DoS.
- Forge or rotate user agents (including ChatGPT and browsers) and use varied IP ranges (cloud and residential).
Some see “verified bot” allow-lists as entrenching incumbents: big bots that already extracted data get whitelisted and new entrants are blocked.
There’s criticism that the affected app itself crawls podcast feeds/images and may not honor robots.txt, though the author argues this is standard practice in the podcast ecosystem, where hosts are designed for heavy RSS traffic.

Vercel image pricing, spend limits, and alternatives

Many consider the original Vercel Image API pricing (e.g., ~$5/1,000 optimizations) “insanely expensive,” especially given how cheaply image resizing can be done with tools like ImageMagick, Thumbor, imgproxy, or a low-cost VPS/CDN combo (e.g., BunnyCDN).
Vercel staff note they now use cheaper transformation-based pricing and offer soft/hard spend limits (“spend management”), though the UX confused the affected team.
Some see this as an example of “PaaS/ignorance tax” and vendor lock‑in: attractive free tiers, then sharp costs once real traffic arrives.

Infrastructure, caching, and mitigation strategies

Commenters emphasize that a modest VPS plus good caching (nginx/Varnish/Cloudflare/CloudFront) can handle large volumes and bots cheaply.
Experiences differ: some say well‑tuned platforms easily absorb multiple crawlers; others describe AI bots overwhelming sites even behind CDNs.
Suggestions include stricter rate limiting, IP-range blocking, challenging unidentified bots, offloading more work to clients, and treating the marketing site as a first‑class, performance‑critical component.

Related topics