AI web crawlers are destroying websites in their never-ending content hunger
CAPTCHAs and User Friction
- Rising bot abuse is driving more sites to use CAPTCHAs, especially reCAPTCHA and Cloudflare challenges.
- Many commenters now abandon CAPTCHA-heavy sites, sometimes turning to AI tools instead.
- Tools like Anubis are seen as “less bad” than reCAPTCHA but are slow on low-end devices and can break some phones.
Scale and Nature of AI Bot Traffic
- Reports of AI bots consuming orders of magnitude more resources than humans; one operator estimates only ~5% of traffic is real users.
- Bots often ignore caching basics, robots.txt, or polite crawl rates, sometimes hitting dynamic or deep pages at ~1 request/second or worse.
- Large crawlers increasingly spoof user agents and use huge IP pools (hundreds of thousands of IPs) to evade rate limiting and ASN blocks.
Impact on Small Sites and Hosting Costs
- Hobby and mid-sized sites (forums, gaming resources, art galleries, roleplaying communities, railroading forums) describe traffic surges that effectively DDoS them.
- One static gaming site faces ~30GB/day from a single crawler, threatening hundreds of dollars in overage fees. Others have been forced into login walls or paywalls.
- WordPress-backed sites are especially vulnerable due to slow DB-heavy page generation and limited, fragile caching.
Mitigation Tactics in Practice
- Common approaches: blocking known AI user agents, nginx-level filters, rate limiting, fail2ban-style rules, ASN/IP blocklists, honeypots, and tools like Anubis.
- These reduce abuse but create collateral damage for VPN users, non-Chrome browsers, accessibility tools, and privacy-focused clients.
- Arms race dynamic: once blocked, sophisticated crawlers distribute more, fake agents harder, and slow their request patterns.
Why Modern Crawlers Feel Worse Than Old Search Bots
- Earlier search engines were fewer, resource-constrained, and generally honored robots.txt and modest recrawl frequencies.
- AI companies are heavily capitalized, competing on freshness and coverage, and often treat crawl cost as negligible while externalizing bandwidth/CPU to site owners.
- Some commenters claim AI training runs repeatedly re-scrape the web rather than reusing stored corpora.
Centralization, Ethics, and Proposed Structural Fixes
- Many site owners feel driven toward centralized CDNs like Cloudflare simply to survive bot loads, despite worries about internet centralization and surveillance.
- Proposed systemic fixes include:
- Cryptographically signed “good bots” / agent identities.
- Proof-of-work or micropayment gates per request.
- Standardized low-cost APIs, RSS-like feeds, or WARC dumps for scrapers.
- AI-targeted tarpits serving infinite or poisoned content.
- Skeptics argue that abusive actors will ignore any norms, and that expecting small sites to build special feeds for AI is unfair.
Broader Sentiment
- Strong resentment toward AI companies: viewed as unethical “milkshake drinkers” extracting value without compensation and destabilizing the open web.
- Some foresee continued contraction of the public web into walled gardens, paywalls, and CDNs unless crawler behavior changes.