End of an era for me: no more self-hosted git
Why this story resonated
- Commenters see the post as emblematic of a wider loss: small, self-hosted services becoming untenable because of automated abuse.
- Self-hosting is framed as a core early-Web right; being driven off it by scrapers feels like “end of an era” rather than just a technical nuisance.
- Even a relatively unknown personal git instance getting hammered is cited as evidence the problem is now broad, not only for big sites.
Nature and scale of the scraping
- Multiple people running cgit/Forgejo/Gitea/Mercurial report:
- Tens of millions of requests in ~2 months, >99% bots.
- Baseline CPU loads of 30–50% from crawlers alone.
- Continuous floods with highly variable daily volume, sometimes jumping from tens of thousands to millions of requests.
- Bots exhaustively enumerate every commit, diff, blame view, and query combination, often re-fetching unchanged content.
- IPs are highly distributed (millions of addresses, including residential proxies and global data centers), making rate-limiting and IP bans ineffective.
Is it really “AI” traffic?
- Access logs show explicit AI-related user agents (GPTBot, ClaudeBot, Meta, Amazon, PetalBot, Chinese crawlers like YisouSpider).
- Some bots respect robots.txt if explicitly named, but often ignore wildcards; others ignore robots.txt entirely and spoof browser UAs.
- Several commenters attribute more opaque botnets to AI training/RAG or dataset sellers; a minority speculate about cloud providers or generic scraping-for-resale.
- Others argue the core pattern (sloppy, aggressive scrapers) is old; AI mainly increased demand and target value (code, blogs).
Proposed defenses and tradeoffs
- Hardening / restriction:
- SSH-only git, VPN/WireGuard, HTTP basic auth, OAuth/Keycloak, Cloudflare Access; effective but remove or complicate public read-only access.
- Blocking specific countries or ASNs (notably large Chinese networks, sometimes AWS), at the cost of excluding legitimate users.
- Protocol- and app-level changes:
- Static site generators to avoid dynamic load.
- Git web UIs that expose only branch heads, or nginx rules that 404 commit pages.
- Bot filtering:
- Carefully tuned robots.txt naming AI bots individually; reported as effective by some.
- Fail2ban / Crowdsec / nginx limit_req; works for concentrated abuse, but struggles against slow, massively distributed crawlers.
- Honeypots like Anubis, shibboleth cookies + JavaScript reloads, and “poison” responses suggested to frustrate or corrupt bad scrapers; these often rely on mandatory JS and may break no-JS users.
Centralization vs self-hosting
- Cloudflare and similar services are repeatedly suggested (including pay-per-crawl), but:
- Some report they still see large bot volumes through Cloudflare, especially via residential proxies.
- Others worry about extreme centralization of “last mile” web traffic and the erosion of practical self-hosting.
- There is tension between using big-CDN protection and preserving the independence that motivated self-hosting in the first place.
Ethical, legal, and ecosystem concerns
- Many view indiscriminate scraping as theft of labor and bandwidth, turning the open web into an “AI mine” and “DoS-as-a-service.”
- Suggestions include charging per crawl and coordinated “data poisoning” responses, hoping to push AI companies to behave better.
- Some note regulatory and geopolitical factors: weak current law around training-data scraping, AI arms races, and lobbying delaying stronger protections.