Guarding My Git Forge Against AI Scrapers

AI Data Poisoning & Information Warfare

  • Several comments explore the idea of deliberately poisoning LLM training data (e.g., esoteric languages, insecure code) to bias models or degrade their usefulness.
  • People reference claims that relatively small poisoned datasets can impact models, and that state actors are already “LLM grooming” via propaganda.
  • Others push back on specific journalism about Russian disinformation, arguing the cited article lacks evidence and over-villainizes entire nations; some counter that Russia’s behavior largely fits that description.
  • There is general agreement that nation-state information ops exist, but details and scale are contested or seen as unclear.

Scraper Behavior, Inefficiency, and Motives

  • Multiple self-hosters report scrapers hammering every blame/log view and repeating it frequently, suggesting naïve recursive crawlers with heavy parallelization.
  • Comments note most bots just follow links via HTTP, don’t use git clone, and often ignore robots.txt; optimization is rare because bandwidth and compute are externalized costs.
  • Some suggest many operators are “script kiddies” or spammer-like actors chasing quantity, not quality; others speculate some abusive traffic may not even be for AI training but for generic data resale or anti-decentralization incentives.

Defensive Techniques

  • Config toggles: Gitea’s REQUIRE_SIGNIN_VIEW=expensive is praised as cutting AI traffic and bandwidth drastically while still allowing casual browsing; full login-only modes or OAuth2 proxies for heavy repos also work well.
  • Network controls: putting forges behind Wireguard/Tailscale VPNs, IP/ASN or country-level blocking (especially for non-global audiences), and HTTP/2 requirements are common patterns; people warn about false positives (e.g., travelers, Starlink).
  • Fingerprinting: JA3/JA4 TLS fingerprints, TCP header quirks, and browser-like headers help distinguish many bots from real users; residential proxies and SIM-based botnets complicate this.
  • Architectural fixes: static git viewers (stagit, rgit, custom static sites) served by simple HTTP servers, or throttling via reverse proxies, dramatically reduce load.
  • “Punitive” responses: tools like Anubis or Iocaine that serve garbage/mazes to suspected bots have reportedly slashed traffic from hundreds of thousands of hits/day to a tiny fraction.

Ethics, Net Neutrality, and the “Free Web”

  • Several distinguish respectful, mutually beneficial scraping (e.g., search indexing) from abusive AI scraping that behaves like a slow DDoS and diverts users to regurgitated content without attribution or compensation.
  • Some argue “the web should be free for humans,” but bots abusing that norm justify technical barriers—framed as a “paradox of tolerance” moment.
  • Others worry that rising abuse is pushing hobbyist and small sites off the public internet and into VPN-only or heavily walled-garden setups, undermining the original borderless ideal.

Legal / Contract Proposals

  • Ideas like EULAs billing “non-human readers” or forcing model source-code disclosure are floated, but replies broadly agree these are unenforceable: bots hide behind fake UAs, foreign IPs, and lack any practical mechanism for collection or jurisdiction.

Personal Impact & Sentiment

  • Many self-hosters describe depressing bot:human ratios (often 95%+ bots), fans spinning from pointless traffic, and services shut down or locked away as a result.
  • There is a sense of attrition: keeping a small public forge open increasingly means fighting large-scale scraping operations with far more resources.
  • A brief ad hominem jab at the blog author’s identity is countered by others as irrelevant to the technical validity of the article.