2025-03-20

FOSS infrastructure is under attack by AI companies

Technical Countermeasures Against AI Scrapers

Ideas range from simple blocking to active punishment:
- IP / ASN blocking (especially Alibaba, cloud providers), rate limiting, fail2ban, CAPTCHAs, loginwalls.
- Proof‑of‑work (PoW) gates such as Anubis and Cloudflare’s “labyrinth”: make each request computationally expensive while cheap to verify.
- Tarpits and slowloris‑style throttling: trickle responses to waste bot time without consuming much server work.
- Honeypots and “AI tarpits”: hidden links or paths only bots see, leading to infinite or Markov‑generated junk.
People worry about:
- Collateral damage to real users (slow pages, extra friction, accessibility issues).
- Arms race dynamics: once techniques become common, bots will adapt (headless Chrome, GPUs/ASICs for PoW, residential proxies).
- Legal risk of “sabotage” (zip bombs, poisoning) under computer misuse laws.

Impact on FOSS and Small Infrastructure

Multiple operators report:
- Crawlers hammering expensive git endpoints (blame, per‑commit views, archive downloads), often via web UIs instead of git clone.
- Ignoring robots.txt, HTTP 429/503, and cache headers; faking or randomizing user agents; using thousands of IPs, often residential or cloud.
- Massive bandwidth bills on commercial clouds (e.g., tens of TB costing thousands of dollars) and disk exhaustion from generated archives.
Some see this as de‑facto DDoS and call for treating it legally as such.
Others say it exposes fragile web apps: heavy SPAs, poor caching, inefficient git frontends; counter‑argument is that even well‑engineered sites can’t economically absorb abusive crawling.

Legal, Licensing, and Economic Debates

Dispute over whether training is “fair use” when the explicit goal is to compete with original authors.
Concerns that opaque models trained on GPL/FOSS code undermine copyleft; proposals for “no AI training” clauses, but these conflict with existing open‑source definitions and are likely to be ignored by bad actors.
Suggestions: lawsuits (copyright or DDoS), terms-of-service traps, collective rights assignment to enforcement orgs; skepticism about cost, uncertainty, and power imbalance.

Future of the Web and Governance

Many expect:
- More content behind auth, payment, or verified identity; decline of anonymous access; stronger bot detection at CDNs.
- Whitelisting of a few “trusted” crawlers (Google, Bing) and de‑facto exclusion of new entrants.
- Further centralization (Cloudflare, big search/AI) and possible move toward browser attestation.
Philosophical split:
- One side: “If it’s public, expect it to be scraped; design accordingly.”
- Other side: sees AI firms as consciously externalizing costs, eroding the open web and FOSS goodwill, and pushing toward a feudal, enclosure‑style Internet.

Related topics