2025-12-12

Guarding My Git Forge Against AI Scrapers

AI Data Poisoning & Information Warfare

Several comments explore the idea of deliberately poisoning LLM training data (e.g., esoteric languages, insecure code) to bias models or degrade their usefulness.
People reference claims that relatively small poisoned datasets can impact models, and that state actors are already “LLM grooming” via propaganda.
Others push back on specific journalism about Russian disinformation, arguing the cited article lacks evidence and over-villainizes entire nations; some counter that Russia’s behavior largely fits that description.
There is general agreement that nation-state information ops exist, but details and scale are contested or seen as unclear.

Scraper Behavior, Inefficiency, and Motives

Multiple self-hosters report scrapers hammering every blame/log view and repeating it frequently, suggesting naïve recursive crawlers with heavy parallelization.
Comments note most bots just follow links via HTTP, don’t use git clone, and often ignore robots.txt; optimization is rare because bandwidth and compute are externalized costs.
Some suggest many operators are “script kiddies” or spammer-like actors chasing quantity, not quality; others speculate some abusive traffic may not even be for AI training but for generic data resale or anti-decentralization incentives.

Defensive Techniques

Config toggles: Gitea’s REQUIRE_SIGNIN_VIEW=expensive is praised as cutting AI traffic and bandwidth drastically while still allowing casual browsing; full login-only modes or OAuth2 proxies for heavy repos also work well.
Network controls: putting forges behind Wireguard/Tailscale VPNs, IP/ASN or country-level blocking (especially for non-global audiences), and HTTP/2 requirements are common patterns; people warn about false positives (e.g., travelers, Starlink).
Fingerprinting: JA3/JA4 TLS fingerprints, TCP header quirks, and browser-like headers help distinguish many bots from real users; residential proxies and SIM-based botnets complicate this.
Architectural fixes: static git viewers (stagit, rgit, custom static sites) served by simple HTTP servers, or throttling via reverse proxies, dramatically reduce load.
“Punitive” responses: tools like Anubis or Iocaine that serve garbage/mazes to suspected bots have reportedly slashed traffic from hundreds of thousands of hits/day to a tiny fraction.

Ethics, Net Neutrality, and the “Free Web”

Several distinguish respectful, mutually beneficial scraping (e.g., search indexing) from abusive AI scraping that behaves like a slow DDoS and diverts users to regurgitated content without attribution or compensation.
Some argue “the web should be free for humans,” but bots abusing that norm justify technical barriers—framed as a “paradox of tolerance” moment.
Others worry that rising abuse is pushing hobbyist and small sites off the public internet and into VPN-only or heavily walled-garden setups, undermining the original borderless ideal.

Legal / Contract Proposals

Ideas like EULAs billing “non-human readers” or forcing model source-code disclosure are floated, but replies broadly agree these are unenforceable: bots hide behind fake UAs, foreign IPs, and lack any practical mechanism for collection or jurisdiction.

Personal Impact & Sentiment

Many self-hosters describe depressing bot:human ratios (often 95%+ bots), fans spinning from pointless traffic, and services shut down or locked away as a result.
There is a sense of attrition: keeping a small public forge open increasingly means fighting large-scale scraping operations with far more resources.
A brief ad hominem jab at the blog author’s identity is countered by others as irrelevant to the technical validity of the article.

Related topics