Devs say AI crawlers dominate traffic, forcing blocks on entire countries

Scale and impact of AI crawling

  • Multiple operators of small and mid-sized sites report being overwhelmed: hundreds of thousands to millions of automated requests per day vs ~1,000 legitimate ones, sometimes forcing shutdowns or logins.
  • Specific anecdotes of Claude/ChatGPT-style bots hammering sites (hundreds of thousands of hits/month, triggering bandwidth caps; ignoring HTTP 429 and connection drops).
  • Some see all major AI providers as “equally terrible,” with many bots spoofing big-company user agents (often Amazon) and coming from large cloud IP ranges or residential botnets.

Crude defenses: country/IP blocks and walled gardens

  • Country-level IP blocking is described as both “lazy but pragmatic”: fine if you truly expect zero real users there, but dangerous for general services or international businesses.
  • Historic/geopolitical blocks (e.g., blocking access from Israel) raise ethical concerns about collective punishment vs targeted accountability.
  • Many sites now restrict dynamic features to logged-in users, move behind Cloudflare, or fully auth-wall content that used to be public.
  • There’s nostalgia for the “old web” and a sense that AI scraping is accelerating its replacement by login walls, private networks, and “friends-and-family” intranets.

Technical mitigation ideas

  • Rate limiting is hard when crawlers rotate across thousands of IPs and mimic normal browsers; IP-based limits mostly work only against data center ranges.
  • Debated approaches:
    • Server-side delays vs client-side proof-of-work: PoW (e.g., Anubis, hashcash-like JS) is stateless and cheap for servers, but burns client CPU and can be bypassed with enough hardware.
    • Connection tarpits (slow uploads, long-lived sockets) are limited by server resources.
    • Session- or fingerprint-based tracking (JA4, cookies) vs a desire to avoid maintaining state or databases.
  • Cloudflare-style protections (Turnstile, AI-block toggles, AI Labyrinth) are popular but raise centralization and “single point of failure” worries.

Robots.txt, licenses, and legal/ethical angles

  • Consensus that robots.txt is only a courtesy: malicious and many AI crawlers ignore it; “canary” URLs in robots.txt are used to detect bad bots.
  • Updating open-source licenses or copyright language is seen as largely toothless if big companies already ignore existing terms and treat lawsuits as a business cost.
  • Litigation for DDoS-like crawling is considered expensive and uncertain: “I made a public resource and they used it too much” may not win damages.

Poisoning and “making bots pay”

  • Several propose making scrapers get negative value:
    • Serving plausible but factually wrong content to suspected bots.
    • AI-generated labyrinths or honeypots to waste their compute.
    • ZIP bombs, XML bombs, invalid data, or tiny compressed responses that expand massively client-side.
  • Others push back that deliberately adding misinformation or energy-wasting schemes is socially harmful and may cost site owners more than blocking.

Broader consequences for the web and search

  • Concern that aggressive, often redundant crawling (many actors re-scraping the same static pages) wastes enormous bandwidth and infrastructure.
  • Widespread AI blocking could further entrench existing monopolies (Google for search; Cloudflare for protection), since new crawlers are often blocked by default.
  • Some argue that future “SEO” will be about being in LLM training data and answer engines; blocking crawlers might mean not being discoverable at all—though critics note that LLMs rarely send useful traffic back to source sites.
  • Underlying debate frames AI firms’ behavior as a symptom of capitalism’s incentives vs regulation/sanctions as a counterweight, with no clear resolution.