2026-02-11

End of an era for me: no more self-hosted git

Why this story resonated

Commenters see the post as emblematic of a wider loss: small, self-hosted services becoming untenable because of automated abuse.
Self-hosting is framed as a core early-Web right; being driven off it by scrapers feels like “end of an era” rather than just a technical nuisance.
Even a relatively unknown personal git instance getting hammered is cited as evidence the problem is now broad, not only for big sites.

Nature and scale of the scraping

Multiple people running cgit/Forgejo/Gitea/Mercurial report:
- Tens of millions of requests in ~2 months, >99% bots.
- Baseline CPU loads of 30–50% from crawlers alone.
- Continuous floods with highly variable daily volume, sometimes jumping from tens of thousands to millions of requests.
Bots exhaustively enumerate every commit, diff, blame view, and query combination, often re-fetching unchanged content.
IPs are highly distributed (millions of addresses, including residential proxies and global data centers), making rate-limiting and IP bans ineffective.

Is it really “AI” traffic?

Access logs show explicit AI-related user agents (GPTBot, ClaudeBot, Meta, Amazon, PetalBot, Chinese crawlers like YisouSpider).
Some bots respect robots.txt if explicitly named, but often ignore wildcards; others ignore robots.txt entirely and spoof browser UAs.
Several commenters attribute more opaque botnets to AI training/RAG or dataset sellers; a minority speculate about cloud providers or generic scraping-for-resale.
Others argue the core pattern (sloppy, aggressive scrapers) is old; AI mainly increased demand and target value (code, blogs).

Proposed defenses and tradeoffs

Hardening / restriction:
- SSH-only git, VPN/WireGuard, HTTP basic auth, OAuth/Keycloak, Cloudflare Access; effective but remove or complicate public read-only access.
- Blocking specific countries or ASNs (notably large Chinese networks, sometimes AWS), at the cost of excluding legitimate users.
Protocol- and app-level changes:
- Static site generators to avoid dynamic load.
- Git web UIs that expose only branch heads, or nginx rules that 404 commit pages.
Bot filtering:
- Carefully tuned robots.txt naming AI bots individually; reported as effective by some.
- Fail2ban / Crowdsec / nginx limit_req; works for concentrated abuse, but struggles against slow, massively distributed crawlers.
- Honeypots like Anubis, shibboleth cookies + JavaScript reloads, and “poison” responses suggested to frustrate or corrupt bad scrapers; these often rely on mandatory JS and may break no-JS users.

Centralization vs self-hosting

Cloudflare and similar services are repeatedly suggested (including pay-per-crawl), but:
- Some report they still see large bot volumes through Cloudflare, especially via residential proxies.
- Others worry about extreme centralization of “last mile” web traffic and the erosion of practical self-hosting.
There is tension between using big-CDN protection and preserving the independence that motivated self-hosting in the first place.

Ethical, legal, and ecosystem concerns

Many view indiscriminate scraping as theft of labor and bandwidth, turning the open web into an “AI mine” and “DoS-as-a-service.”
Suggestions include charging per crawl and coordinated “data poisoning” responses, hoping to push AI companies to behave better.
Some note regulatory and geopolitical factors: weak current law around training-data scraping, AI arms races, and lobbying delaying stronger protections.

Related topics