2025-09-02

AI web crawlers are destroying websites in their never-ending content hunger

CAPTCHAs and User Friction

Rising bot abuse is driving more sites to use CAPTCHAs, especially reCAPTCHA and Cloudflare challenges.
Many commenters now abandon CAPTCHA-heavy sites, sometimes turning to AI tools instead.
Tools like Anubis are seen as “less bad” than reCAPTCHA but are slow on low-end devices and can break some phones.

Scale and Nature of AI Bot Traffic

Reports of AI bots consuming orders of magnitude more resources than humans; one operator estimates only ~5% of traffic is real users.
Bots often ignore caching basics, robots.txt, or polite crawl rates, sometimes hitting dynamic or deep pages at ~1 request/second or worse.
Large crawlers increasingly spoof user agents and use huge IP pools (hundreds of thousands of IPs) to evade rate limiting and ASN blocks.

Impact on Small Sites and Hosting Costs

Hobby and mid-sized sites (forums, gaming resources, art galleries, roleplaying communities, railroading forums) describe traffic surges that effectively DDoS them.
One static gaming site faces ~30GB/day from a single crawler, threatening hundreds of dollars in overage fees. Others have been forced into login walls or paywalls.
WordPress-backed sites are especially vulnerable due to slow DB-heavy page generation and limited, fragile caching.

Mitigation Tactics in Practice

Common approaches: blocking known AI user agents, nginx-level filters, rate limiting, fail2ban-style rules, ASN/IP blocklists, honeypots, and tools like Anubis.
These reduce abuse but create collateral damage for VPN users, non-Chrome browsers, accessibility tools, and privacy-focused clients.
Arms race dynamic: once blocked, sophisticated crawlers distribute more, fake agents harder, and slow their request patterns.

Why Modern Crawlers Feel Worse Than Old Search Bots

Earlier search engines were fewer, resource-constrained, and generally honored robots.txt and modest recrawl frequencies.
AI companies are heavily capitalized, competing on freshness and coverage, and often treat crawl cost as negligible while externalizing bandwidth/CPU to site owners.
Some commenters claim AI training runs repeatedly re-scrape the web rather than reusing stored corpora.

Centralization, Ethics, and Proposed Structural Fixes

Many site owners feel driven toward centralized CDNs like Cloudflare simply to survive bot loads, despite worries about internet centralization and surveillance.
Proposed systemic fixes include:
- Cryptographically signed “good bots” / agent identities.
- Proof-of-work or micropayment gates per request.
- Standardized low-cost APIs, RSS-like feeds, or WARC dumps for scrapers.
- AI-targeted tarpits serving infinite or poisoned content.
Skeptics argue that abusive actors will ignore any norms, and that expecting small sites to build special feeds for AI is unfair.

Broader Sentiment

Strong resentment toward AI companies: viewed as unethical “milkshake drinkers” extracting value without compensation and destabilizing the open web.
Some foresee continued contraction of the public web into walled gardens, paywalls, and CDNs unless crawler behavior changes.

Related topics