AI crawlers need to be more respectful

Scale and Impact of AI Crawlers

  • Multiple operators report AI crawlers generating far more load than all search engines + humans combined.
  • Example from the article: tens of TBs in a month from a single buggy crawler, costing thousands in bandwidth.
  • Some see 2–3 AI crawlers consuming the majority of their traffic; others argue that, relative to all crawlers globally, “only a few bad ones” misbehaving is not surprising but still costly.

Comparisons with Traditional Search Engines

  • Many distinguish between old search crawlers and AI crawlers: search used to send traffic back; AI and modern search “answer pages” can extract value without referrals.
  • Googlebot is described as comparatively “well-behaved” but imperfect around 429/503 handling and Retry-After.
  • Non-Western and some commercial crawlers are criticized for high crawl rates with little or no referral traffic.

Mitigation Strategies and Their Limits

  • Common defenses: IP-based rate limiting, CAPTCHAs, fail2ban, spider traps, “infinite garbage” pages, honeypot services, and aggressive IP blocking (including whole cloud-provider ranges or even countries).
  • Others argue this hurts real users (e.g., shared IPs, old user agents, mobile CGNAT, Tor) and is hard for public-information sites.
  • Suggestion to rate-limit non-browser user agents; counterpoint: bots spoof modern UAs.
  • Distributed crawlers from many cloud IPs bypass simple per-IP rate limits.

Hosting Costs and Infrastructure Choices

  • Several commenters say the real problem is expensive bandwidth on big clouds; others counter that documentation/text sites shouldn’t need heavy infra until bots appear.
  • Alternatives suggested: cheaper EU hosts, dedicated fiber, unmetered racks, better CDN integration.

Legal and Policy Debates

  • Debate over whether abusive crawling is “theft of service” or only a ToS issue if the crawler has explicitly agreed (login-gated content vs public pages).
  • Some call for lawsuits, fines, or invoicing abusive crawlers; others doubt cross-border enforceability.
  • Robots.txt is seen as a social norm, not a strong legal instrument.

Broader Concerns About the Web’s Future

  • Many see AI data-scraping as a race-to-the-bottom “tragedy of the commons,” accelerating paywalls and enclosure of useful content.
  • Some call for standardized, rate-limited machine-readable feeds/APIs and even regulatory standards enforced via CDNs/ISPs.
  • Others are pessimistic: as long as users get convenience and dopamine, they’ll tolerate exploitative crawling and centralization.