End of an era for me: no more self-hosted git

Why this story resonated

  • Commenters see the post as emblematic of a wider loss: small, self-hosted services becoming untenable because of automated abuse.
  • Self-hosting is framed as a core early-Web right; being driven off it by scrapers feels like “end of an era” rather than just a technical nuisance.
  • Even a relatively unknown personal git instance getting hammered is cited as evidence the problem is now broad, not only for big sites.

Nature and scale of the scraping

  • Multiple people running cgit/Forgejo/Gitea/Mercurial report:
    • Tens of millions of requests in ~2 months, >99% bots.
    • Baseline CPU loads of 30–50% from crawlers alone.
    • Continuous floods with highly variable daily volume, sometimes jumping from tens of thousands to millions of requests.
  • Bots exhaustively enumerate every commit, diff, blame view, and query combination, often re-fetching unchanged content.
  • IPs are highly distributed (millions of addresses, including residential proxies and global data centers), making rate-limiting and IP bans ineffective.

Is it really “AI” traffic?

  • Access logs show explicit AI-related user agents (GPTBot, ClaudeBot, Meta, Amazon, PetalBot, Chinese crawlers like YisouSpider).
  • Some bots respect robots.txt if explicitly named, but often ignore wildcards; others ignore robots.txt entirely and spoof browser UAs.
  • Several commenters attribute more opaque botnets to AI training/RAG or dataset sellers; a minority speculate about cloud providers or generic scraping-for-resale.
  • Others argue the core pattern (sloppy, aggressive scrapers) is old; AI mainly increased demand and target value (code, blogs).

Proposed defenses and tradeoffs

  • Hardening / restriction:
    • SSH-only git, VPN/WireGuard, HTTP basic auth, OAuth/Keycloak, Cloudflare Access; effective but remove or complicate public read-only access.
    • Blocking specific countries or ASNs (notably large Chinese networks, sometimes AWS), at the cost of excluding legitimate users.
  • Protocol- and app-level changes:
    • Static site generators to avoid dynamic load.
    • Git web UIs that expose only branch heads, or nginx rules that 404 commit pages.
  • Bot filtering:
    • Carefully tuned robots.txt naming AI bots individually; reported as effective by some.
    • Fail2ban / Crowdsec / nginx limit_req; works for concentrated abuse, but struggles against slow, massively distributed crawlers.
    • Honeypots like Anubis, shibboleth cookies + JavaScript reloads, and “poison” responses suggested to frustrate or corrupt bad scrapers; these often rely on mandatory JS and may break no-JS users.

Centralization vs self-hosting

  • Cloudflare and similar services are repeatedly suggested (including pay-per-crawl), but:
    • Some report they still see large bot volumes through Cloudflare, especially via residential proxies.
    • Others worry about extreme centralization of “last mile” web traffic and the erosion of practical self-hosting.
  • There is tension between using big-CDN protection and preserving the independence that motivated self-hosting in the first place.

Ethical, legal, and ecosystem concerns

  • Many view indiscriminate scraping as theft of labor and bandwidth, turning the open web into an “AI mine” and “DoS-as-a-service.”
  • Suggestions include charging per crawl and coordinated “data poisoning” responses, hoping to push AI companies to behave better.
  • Some note regulatory and geopolitical factors: weak current law around training-data scraping, AI arms races, and lobbying delaying stronger protections.