Amazon's AI crawler is making my Git server unstable

AI/SEO Crawlers Overloading Servers

  • Many report similar issues: AI and SEO bots hammering sites, often to the point of high load or near‑DDoS, especially on git forges and dynamic code viewers.
  • Git/web UIs are a worst case: every commit, diff, blame view, and historical state becomes a crawlable URL, and bots naively follow “infinite” link graphs.
  • Some see certain bots (e.g., Bytespider, Amazonbot, Claude, GPT, Meta, Facebook) dominating traffic, occasionally exceeding all human traffic by orders of magnitude.

Robots.txt Behavior and Bot Identification

  • Several say AI bots “barely” respect robots.txt:
    • Some only honor directives when their exact user agent is named, ignoring wildcards.
    • Some ignore non-standard but commonly used directives like Crawl-delay.
  • Conflicting claims about specific bots:
    • Some logs show Amazonbot‑like UAs and reverse DNS; others argue user agents and rdns are trivially spoofed.
    • An Amazon employee states the described behavior (residential IPs, changing UAs, ignoring robots.txt) is unlikely to be legitimate Amazonbot and suggests treating it as malicious traffic.
  • Ambiguity remains over whether certain traffic is truly from big-company crawlers, botnets using residential proxies, or impersonators.

Technical Mitigations Proposed

  • Network-level:
    • Tarpits (e.g., iptables TARPIT, tools like Nepenthes) to slow abusive clients.
    • Rate limiting per user agent or bucket (Nginx limit_req examples), per IP, CIDR, or ASN.
    • Blocking known cloud IP ranges (AWS lists), though this risks collateral damage and fails against residential proxies.
  • Application-level:
    • Detailed robots.txt that explicitly names AI bots, using community-maintained lists.
    • Honeypot links disallowed in robots.txt: any client that fetches them gets banned.
    • Captchas or “anonymous login” gates for repo viewers; proof-of-work reverse proxies / Hashcash.
    • Static or pre‑obfuscated content to reduce compute load (e.g., content obfuscators, tarpit pages, zip bombs—though effectiveness is debated).
  • Service-level:
    • Putting sites behind Cloudflare or similar bot management/CDN layers, despite dislike of centralization.

Legal and Ethical Questions

  • Debate over whether ignoring robots.txt or ToS is legally actionable:
    • Some cite robots.txt as non-binding but potentially relevant evidence of “unauthorized” access.
    • Others reference US CFAA guidance and UK Computer Misuse Act, suggesting a cease-and-desist plus continued access might cross a line.
  • Suggestions to pursue cease-and-desist letters and potential criminal complaints versus skepticism that law enforcement will care absent large rights holders.

Broader Impact and Sentiment

  • Many view aggressive AI scraping as ethically hostile: exploiting others’ bandwidth and content without consent or compensation.
  • Some argue this behavior accelerates the move from open web content to closed platforms (Discord, etc.), degrading the public internet.
  • A minority downplay the issue, saying admins should simply scale, cache, and throttle; others counter that small operators can’t cheaply absorb multi‑TB scraping.
  • Ideas about “poisoning” AI training data surface, but some argue AI firms prioritize quantity over quality, so the only effective response is denying access entirely.