Guarding My Git Forge Against AI Scrapers
AI Data Poisoning & Information Warfare
- Several comments explore the idea of deliberately poisoning LLM training data (e.g., esoteric languages, insecure code) to bias models or degrade their usefulness.
- People reference claims that relatively small poisoned datasets can impact models, and that state actors are already “LLM grooming” via propaganda.
- Others push back on specific journalism about Russian disinformation, arguing the cited article lacks evidence and over-villainizes entire nations; some counter that Russia’s behavior largely fits that description.
- There is general agreement that nation-state information ops exist, but details and scale are contested or seen as unclear.
Scraper Behavior, Inefficiency, and Motives
- Multiple self-hosters report scrapers hammering every blame/log view and repeating it frequently, suggesting naïve recursive crawlers with heavy parallelization.
- Comments note most bots just follow links via HTTP, don’t use
git clone, and often ignore robots.txt; optimization is rare because bandwidth and compute are externalized costs. - Some suggest many operators are “script kiddies” or spammer-like actors chasing quantity, not quality; others speculate some abusive traffic may not even be for AI training but for generic data resale or anti-decentralization incentives.
Defensive Techniques
- Config toggles: Gitea’s
REQUIRE_SIGNIN_VIEW=expensiveis praised as cutting AI traffic and bandwidth drastically while still allowing casual browsing; full login-only modes or OAuth2 proxies for heavy repos also work well. - Network controls: putting forges behind Wireguard/Tailscale VPNs, IP/ASN or country-level blocking (especially for non-global audiences), and HTTP/2 requirements are common patterns; people warn about false positives (e.g., travelers, Starlink).
- Fingerprinting: JA3/JA4 TLS fingerprints, TCP header quirks, and browser-like headers help distinguish many bots from real users; residential proxies and SIM-based botnets complicate this.
- Architectural fixes: static git viewers (stagit, rgit, custom static sites) served by simple HTTP servers, or throttling via reverse proxies, dramatically reduce load.
- “Punitive” responses: tools like Anubis or Iocaine that serve garbage/mazes to suspected bots have reportedly slashed traffic from hundreds of thousands of hits/day to a tiny fraction.
Ethics, Net Neutrality, and the “Free Web”
- Several distinguish respectful, mutually beneficial scraping (e.g., search indexing) from abusive AI scraping that behaves like a slow DDoS and diverts users to regurgitated content without attribution or compensation.
- Some argue “the web should be free for humans,” but bots abusing that norm justify technical barriers—framed as a “paradox of tolerance” moment.
- Others worry that rising abuse is pushing hobbyist and small sites off the public internet and into VPN-only or heavily walled-garden setups, undermining the original borderless ideal.
Legal / Contract Proposals
- Ideas like EULAs billing “non-human readers” or forcing model source-code disclosure are floated, but replies broadly agree these are unenforceable: bots hide behind fake UAs, foreign IPs, and lack any practical mechanism for collection or jurisdiction.
Personal Impact & Sentiment
- Many self-hosters describe depressing bot:human ratios (often 95%+ bots), fans spinning from pointless traffic, and services shut down or locked away as a result.
- There is a sense of attrition: keeping a small public forge open increasingly means fighting large-scale scraping operations with far more resources.
- A brief ad hominem jab at the blog author’s identity is countered by others as irrelevant to the technical validity of the article.