Amazon's AI crawler is making my Git server unstable
AI/SEO Crawlers Overloading Servers
- Many report similar issues: AI and SEO bots hammering sites, often to the point of high load or near‑DDoS, especially on git forges and dynamic code viewers.
- Git/web UIs are a worst case: every commit, diff, blame view, and historical state becomes a crawlable URL, and bots naively follow “infinite” link graphs.
- Some see certain bots (e.g., Bytespider, Amazonbot, Claude, GPT, Meta, Facebook) dominating traffic, occasionally exceeding all human traffic by orders of magnitude.
Robots.txt Behavior and Bot Identification
- Several say AI bots “barely” respect robots.txt:
- Some only honor directives when their exact user agent is named, ignoring wildcards.
- Some ignore non-standard but commonly used directives like
Crawl-delay.
- Conflicting claims about specific bots:
- Some logs show Amazonbot‑like UAs and reverse DNS; others argue user agents and rdns are trivially spoofed.
- An Amazon employee states the described behavior (residential IPs, changing UAs, ignoring robots.txt) is unlikely to be legitimate Amazonbot and suggests treating it as malicious traffic.
- Ambiguity remains over whether certain traffic is truly from big-company crawlers, botnets using residential proxies, or impersonators.
Technical Mitigations Proposed
- Network-level:
- Tarpits (e.g., iptables
TARPIT, tools like Nepenthes) to slow abusive clients. - Rate limiting per user agent or bucket (Nginx
limit_reqexamples), per IP, CIDR, or ASN. - Blocking known cloud IP ranges (AWS lists), though this risks collateral damage and fails against residential proxies.
- Tarpits (e.g., iptables
- Application-level:
- Detailed robots.txt that explicitly names AI bots, using community-maintained lists.
- Honeypot links disallowed in robots.txt: any client that fetches them gets banned.
- Captchas or “anonymous login” gates for repo viewers; proof-of-work reverse proxies / Hashcash.
- Static or pre‑obfuscated content to reduce compute load (e.g., content obfuscators, tarpit pages, zip bombs—though effectiveness is debated).
- Service-level:
- Putting sites behind Cloudflare or similar bot management/CDN layers, despite dislike of centralization.
Legal and Ethical Questions
- Debate over whether ignoring robots.txt or ToS is legally actionable:
- Some cite robots.txt as non-binding but potentially relevant evidence of “unauthorized” access.
- Others reference US CFAA guidance and UK Computer Misuse Act, suggesting a cease-and-desist plus continued access might cross a line.
- Suggestions to pursue cease-and-desist letters and potential criminal complaints versus skepticism that law enforcement will care absent large rights holders.
Broader Impact and Sentiment
- Many view aggressive AI scraping as ethically hostile: exploiting others’ bandwidth and content without consent or compensation.
- Some argue this behavior accelerates the move from open web content to closed platforms (Discord, etc.), degrading the public internet.
- A minority downplay the issue, saying admins should simply scale, cache, and throttle; others counter that small operators can’t cheaply absorb multi‑TB scraping.
- Ideas about “poisoning” AI training data surface, but some argue AI firms prioritize quantity over quality, so the only effective response is denying access entirely.