2025-03-25

Devs say AI crawlers dominate traffic, forcing blocks on entire countries

Scale and impact of AI crawling

Multiple operators of small and mid-sized sites report being overwhelmed: hundreds of thousands to millions of automated requests per day vs ~1,000 legitimate ones, sometimes forcing shutdowns or logins.
Specific anecdotes of Claude/ChatGPT-style bots hammering sites (hundreds of thousands of hits/month, triggering bandwidth caps; ignoring HTTP 429 and connection drops).
Some see all major AI providers as “equally terrible,” with many bots spoofing big-company user agents (often Amazon) and coming from large cloud IP ranges or residential botnets.

Crude defenses: country/IP blocks and walled gardens

Country-level IP blocking is described as both “lazy but pragmatic”: fine if you truly expect zero real users there, but dangerous for general services or international businesses.
Historic/geopolitical blocks (e.g., blocking access from Israel) raise ethical concerns about collective punishment vs targeted accountability.
Many sites now restrict dynamic features to logged-in users, move behind Cloudflare, or fully auth-wall content that used to be public.
There’s nostalgia for the “old web” and a sense that AI scraping is accelerating its replacement by login walls, private networks, and “friends-and-family” intranets.

Technical mitigation ideas

Rate limiting is hard when crawlers rotate across thousands of IPs and mimic normal browsers; IP-based limits mostly work only against data center ranges.
Debated approaches:
- Server-side delays vs client-side proof-of-work: PoW (e.g., Anubis, hashcash-like JS) is stateless and cheap for servers, but burns client CPU and can be bypassed with enough hardware.
- Connection tarpits (slow uploads, long-lived sockets) are limited by server resources.
- Session- or fingerprint-based tracking (JA4, cookies) vs a desire to avoid maintaining state or databases.
Cloudflare-style protections (Turnstile, AI-block toggles, AI Labyrinth) are popular but raise centralization and “single point of failure” worries.

Robots.txt, licenses, and legal/ethical angles

Consensus that robots.txt is only a courtesy: malicious and many AI crawlers ignore it; “canary” URLs in robots.txt are used to detect bad bots.
Updating open-source licenses or copyright language is seen as largely toothless if big companies already ignore existing terms and treat lawsuits as a business cost.
Litigation for DDoS-like crawling is considered expensive and uncertain: “I made a public resource and they used it too much” may not win damages.

Poisoning and “making bots pay”

Several propose making scrapers get negative value:
- Serving plausible but factually wrong content to suspected bots.
- AI-generated labyrinths or honeypots to waste their compute.
- ZIP bombs, XML bombs, invalid data, or tiny compressed responses that expand massively client-side.
Others push back that deliberately adding misinformation or energy-wasting schemes is socially harmful and may cost site owners more than blocking.

Broader consequences for the web and search

Concern that aggressive, often redundant crawling (many actors re-scraping the same static pages) wastes enormous bandwidth and infrastructure.
Widespread AI blocking could further entrench existing monopolies (Google for search; Cloudflare for protection), since new crawlers are often blocked by default.
Some argue that future “SEO” will be about being in LLM training data and answer engines; blocking crawlers might mean not being discoverable at all—though critics note that LLMs rarely send useful traffic back to source sites.
Underlying debate frames AI firms’ behavior as a symptom of capitalism’s incentives vs regulation/sanctions as a counterweight, with no clear resolution.

Related topics