2024-07-25

AI crawlers need to be more respectful

Scale and Impact of AI Crawlers

Multiple operators report AI crawlers generating far more load than all search engines + humans combined.
Example from the article: tens of TBs in a month from a single buggy crawler, costing thousands in bandwidth.
Some see 2–3 AI crawlers consuming the majority of their traffic; others argue that, relative to all crawlers globally, “only a few bad ones” misbehaving is not surprising but still costly.

Comparisons with Traditional Search Engines

Many distinguish between old search crawlers and AI crawlers: search used to send traffic back; AI and modern search “answer pages” can extract value without referrals.
Googlebot is described as comparatively “well-behaved” but imperfect around 429/503 handling and Retry-After.
Non-Western and some commercial crawlers are criticized for high crawl rates with little or no referral traffic.

Mitigation Strategies and Their Limits

Common defenses: IP-based rate limiting, CAPTCHAs, fail2ban, spider traps, “infinite garbage” pages, honeypot services, and aggressive IP blocking (including whole cloud-provider ranges or even countries).
Others argue this hurts real users (e.g., shared IPs, old user agents, mobile CGNAT, Tor) and is hard for public-information sites.
Suggestion to rate-limit non-browser user agents; counterpoint: bots spoof modern UAs.
Distributed crawlers from many cloud IPs bypass simple per-IP rate limits.

Hosting Costs and Infrastructure Choices

Several commenters say the real problem is expensive bandwidth on big clouds; others counter that documentation/text sites shouldn’t need heavy infra until bots appear.
Alternatives suggested: cheaper EU hosts, dedicated fiber, unmetered racks, better CDN integration.

Legal and Policy Debates

Debate over whether abusive crawling is “theft of service” or only a ToS issue if the crawler has explicitly agreed (login-gated content vs public pages).
Some call for lawsuits, fines, or invoicing abusive crawlers; others doubt cross-border enforceability.
Robots.txt is seen as a social norm, not a strong legal instrument.

Broader Concerns About the Web’s Future

Many see AI data-scraping as a race-to-the-bottom “tragedy of the commons,” accelerating paywalls and enclosure of useful content.
Some call for standardized, rate-limited machine-readable feeds/APIs and even regulatory standards enforced via CDNs/ISPs.
Others are pessimistic: as long as users get convenience and dopamine, they’ll tolerate exploitative crawling and centralization.

Related topics