2025-01-18

Amazon's AI crawler is making my Git server unstable

AI/SEO Crawlers Overloading Servers

Many report similar issues: AI and SEO bots hammering sites, often to the point of high load or near‑DDoS, especially on git forges and dynamic code viewers.
Git/web UIs are a worst case: every commit, diff, blame view, and historical state becomes a crawlable URL, and bots naively follow “infinite” link graphs.
Some see certain bots (e.g., Bytespider, Amazonbot, Claude, GPT, Meta, Facebook) dominating traffic, occasionally exceeding all human traffic by orders of magnitude.

Robots.txt Behavior and Bot Identification

Several say AI bots “barely” respect robots.txt:
- Some only honor directives when their exact user agent is named, ignoring wildcards.
- Some ignore non-standard but commonly used directives like Crawl-delay.
Conflicting claims about specific bots:
- Some logs show Amazonbot‑like UAs and reverse DNS; others argue user agents and rdns are trivially spoofed.
- An Amazon employee states the described behavior (residential IPs, changing UAs, ignoring robots.txt) is unlikely to be legitimate Amazonbot and suggests treating it as malicious traffic.
Ambiguity remains over whether certain traffic is truly from big-company crawlers, botnets using residential proxies, or impersonators.

Technical Mitigations Proposed

Network-level:
- Tarpits (e.g., iptables TARPIT, tools like Nepenthes) to slow abusive clients.
- Rate limiting per user agent or bucket (Nginx limit_req examples), per IP, CIDR, or ASN.
- Blocking known cloud IP ranges (AWS lists), though this risks collateral damage and fails against residential proxies.
Application-level:
- Detailed robots.txt that explicitly names AI bots, using community-maintained lists.
- Honeypot links disallowed in robots.txt: any client that fetches them gets banned.
- Captchas or “anonymous login” gates for repo viewers; proof-of-work reverse proxies / Hashcash.
- Static or pre‑obfuscated content to reduce compute load (e.g., content obfuscators, tarpit pages, zip bombs—though effectiveness is debated).
Service-level:
- Putting sites behind Cloudflare or similar bot management/CDN layers, despite dislike of centralization.

Legal and Ethical Questions

Debate over whether ignoring robots.txt or ToS is legally actionable:
- Some cite robots.txt as non-binding but potentially relevant evidence of “unauthorized” access.
- Others reference US CFAA guidance and UK Computer Misuse Act, suggesting a cease-and-desist plus continued access might cross a line.
Suggestions to pursue cease-and-desist letters and potential criminal complaints versus skepticism that law enforcement will care absent large rights holders.

Broader Impact and Sentiment

Many view aggressive AI scraping as ethically hostile: exploiting others’ bandwidth and content without consent or compensation.
Some argue this behavior accelerates the move from open web content to closed platforms (Discord, etc.), degrading the public internet.
A minority downplay the issue, saying admins should simply scale, cache, and throttle; others counter that small operators can’t cheaply absorb multi‑TB scraping.
Ideas about “poisoning” AI training data surface, but some argue AI firms prioritize quantity over quality, so the only effective response is denying access entirely.

Related topics