AI web crawlers are destroying websites in their never-ending content hunger

CAPTCHAs and User Friction

  • Rising bot abuse is driving more sites to use CAPTCHAs, especially reCAPTCHA and Cloudflare challenges.
  • Many commenters now abandon CAPTCHA-heavy sites, sometimes turning to AI tools instead.
  • Tools like Anubis are seen as “less bad” than reCAPTCHA but are slow on low-end devices and can break some phones.

Scale and Nature of AI Bot Traffic

  • Reports of AI bots consuming orders of magnitude more resources than humans; one operator estimates only ~5% of traffic is real users.
  • Bots often ignore caching basics, robots.txt, or polite crawl rates, sometimes hitting dynamic or deep pages at ~1 request/second or worse.
  • Large crawlers increasingly spoof user agents and use huge IP pools (hundreds of thousands of IPs) to evade rate limiting and ASN blocks.

Impact on Small Sites and Hosting Costs

  • Hobby and mid-sized sites (forums, gaming resources, art galleries, roleplaying communities, railroading forums) describe traffic surges that effectively DDoS them.
  • One static gaming site faces ~30GB/day from a single crawler, threatening hundreds of dollars in overage fees. Others have been forced into login walls or paywalls.
  • WordPress-backed sites are especially vulnerable due to slow DB-heavy page generation and limited, fragile caching.

Mitigation Tactics in Practice

  • Common approaches: blocking known AI user agents, nginx-level filters, rate limiting, fail2ban-style rules, ASN/IP blocklists, honeypots, and tools like Anubis.
  • These reduce abuse but create collateral damage for VPN users, non-Chrome browsers, accessibility tools, and privacy-focused clients.
  • Arms race dynamic: once blocked, sophisticated crawlers distribute more, fake agents harder, and slow their request patterns.

Why Modern Crawlers Feel Worse Than Old Search Bots

  • Earlier search engines were fewer, resource-constrained, and generally honored robots.txt and modest recrawl frequencies.
  • AI companies are heavily capitalized, competing on freshness and coverage, and often treat crawl cost as negligible while externalizing bandwidth/CPU to site owners.
  • Some commenters claim AI training runs repeatedly re-scrape the web rather than reusing stored corpora.

Centralization, Ethics, and Proposed Structural Fixes

  • Many site owners feel driven toward centralized CDNs like Cloudflare simply to survive bot loads, despite worries about internet centralization and surveillance.
  • Proposed systemic fixes include:
    • Cryptographically signed “good bots” / agent identities.
    • Proof-of-work or micropayment gates per request.
    • Standardized low-cost APIs, RSS-like feeds, or WARC dumps for scrapers.
    • AI-targeted tarpits serving infinite or poisoned content.
  • Skeptics argue that abusive actors will ignore any norms, and that expecting small sites to build special feeds for AI is unfair.

Broader Sentiment

  • Strong resentment toward AI companies: viewed as unethical “milkshake drinkers” extracting value without compensation and destabilizing the open web.
  • Some foresee continued contraction of the public web into walled gardens, paywalls, and CDNs unless crawler behavior changes.