FOSS infrastructure is under attack by AI companies
Technical Countermeasures Against AI Scrapers
- Ideas range from simple blocking to active punishment:
- IP / ASN blocking (especially Alibaba, cloud providers), rate limiting, fail2ban, CAPTCHAs, loginwalls.
- Proof‑of‑work (PoW) gates such as Anubis and Cloudflare’s “labyrinth”: make each request computationally expensive while cheap to verify.
- Tarpits and slowloris‑style throttling: trickle responses to waste bot time without consuming much server work.
- Honeypots and “AI tarpits”: hidden links or paths only bots see, leading to infinite or Markov‑generated junk.
- People worry about:
- Collateral damage to real users (slow pages, extra friction, accessibility issues).
- Arms race dynamics: once techniques become common, bots will adapt (headless Chrome, GPUs/ASICs for PoW, residential proxies).
- Legal risk of “sabotage” (zip bombs, poisoning) under computer misuse laws.
Impact on FOSS and Small Infrastructure
- Multiple operators report:
- Crawlers hammering expensive git endpoints (blame, per‑commit views, archive downloads), often via web UIs instead of
git clone. - Ignoring
robots.txt, HTTP 429/503, and cache headers; faking or randomizing user agents; using thousands of IPs, often residential or cloud. - Massive bandwidth bills on commercial clouds (e.g., tens of TB costing thousands of dollars) and disk exhaustion from generated archives.
- Crawlers hammering expensive git endpoints (blame, per‑commit views, archive downloads), often via web UIs instead of
- Some see this as de‑facto DDoS and call for treating it legally as such.
- Others say it exposes fragile web apps: heavy SPAs, poor caching, inefficient git frontends; counter‑argument is that even well‑engineered sites can’t economically absorb abusive crawling.
Legal, Licensing, and Economic Debates
- Dispute over whether training is “fair use” when the explicit goal is to compete with original authors.
- Concerns that opaque models trained on GPL/FOSS code undermine copyleft; proposals for “no AI training” clauses, but these conflict with existing open‑source definitions and are likely to be ignored by bad actors.
- Suggestions: lawsuits (copyright or DDoS), terms-of-service traps, collective rights assignment to enforcement orgs; skepticism about cost, uncertainty, and power imbalance.
Future of the Web and Governance
- Many expect:
- More content behind auth, payment, or verified identity; decline of anonymous access; stronger bot detection at CDNs.
- Whitelisting of a few “trusted” crawlers (Google, Bing) and de‑facto exclusion of new entrants.
- Further centralization (Cloudflare, big search/AI) and possible move toward browser attestation.
- Philosophical split:
- One side: “If it’s public, expect it to be scraped; design accordingly.”
- Other side: sees AI firms as consciously externalizing costs, eroding the open web and FOSS goodwill, and pushing toward a feudal, enclosure‑style Internet.