AI crawlers need to be more respectful
Scale and Impact of AI Crawlers
- Multiple operators report AI crawlers generating far more load than all search engines + humans combined.
- Example from the article: tens of TBs in a month from a single buggy crawler, costing thousands in bandwidth.
- Some see 2–3 AI crawlers consuming the majority of their traffic; others argue that, relative to all crawlers globally, “only a few bad ones” misbehaving is not surprising but still costly.
Comparisons with Traditional Search Engines
- Many distinguish between old search crawlers and AI crawlers: search used to send traffic back; AI and modern search “answer pages” can extract value without referrals.
- Googlebot is described as comparatively “well-behaved” but imperfect around 429/503 handling and Retry-After.
- Non-Western and some commercial crawlers are criticized for high crawl rates with little or no referral traffic.
Mitigation Strategies and Their Limits
- Common defenses: IP-based rate limiting, CAPTCHAs, fail2ban, spider traps, “infinite garbage” pages, honeypot services, and aggressive IP blocking (including whole cloud-provider ranges or even countries).
- Others argue this hurts real users (e.g., shared IPs, old user agents, mobile CGNAT, Tor) and is hard for public-information sites.
- Suggestion to rate-limit non-browser user agents; counterpoint: bots spoof modern UAs.
- Distributed crawlers from many cloud IPs bypass simple per-IP rate limits.
Hosting Costs and Infrastructure Choices
- Several commenters say the real problem is expensive bandwidth on big clouds; others counter that documentation/text sites shouldn’t need heavy infra until bots appear.
- Alternatives suggested: cheaper EU hosts, dedicated fiber, unmetered racks, better CDN integration.
Legal and Policy Debates
- Debate over whether abusive crawling is “theft of service” or only a ToS issue if the crawler has explicitly agreed (login-gated content vs public pages).
- Some call for lawsuits, fines, or invoicing abusive crawlers; others doubt cross-border enforceability.
- Robots.txt is seen as a social norm, not a strong legal instrument.
Broader Concerns About the Web’s Future
- Many see AI data-scraping as a race-to-the-bottom “tragedy of the commons,” accelerating paywalls and enclosure of useful content.
- Some call for standardized, rate-limited machine-readable feeds/APIs and even regulatory standards enforced via CDNs/ISPs.
- Others are pessimistic: as long as users get convenience and dopamine, they’ll tolerate exploitative crawling and centralization.