FOSS infrastructure is under attack by AI companies

Technical Countermeasures Against AI Scrapers

  • Ideas range from simple blocking to active punishment:
    • IP / ASN blocking (especially Alibaba, cloud providers), rate limiting, fail2ban, CAPTCHAs, loginwalls.
    • Proof‑of‑work (PoW) gates such as Anubis and Cloudflare’s “labyrinth”: make each request computationally expensive while cheap to verify.
    • Tarpits and slowloris‑style throttling: trickle responses to waste bot time without consuming much server work.
    • Honeypots and “AI tarpits”: hidden links or paths only bots see, leading to infinite or Markov‑generated junk.
  • People worry about:
    • Collateral damage to real users (slow pages, extra friction, accessibility issues).
    • Arms race dynamics: once techniques become common, bots will adapt (headless Chrome, GPUs/ASICs for PoW, residential proxies).
    • Legal risk of “sabotage” (zip bombs, poisoning) under computer misuse laws.

Impact on FOSS and Small Infrastructure

  • Multiple operators report:
    • Crawlers hammering expensive git endpoints (blame, per‑commit views, archive downloads), often via web UIs instead of git clone.
    • Ignoring robots.txt, HTTP 429/503, and cache headers; faking or randomizing user agents; using thousands of IPs, often residential or cloud.
    • Massive bandwidth bills on commercial clouds (e.g., tens of TB costing thousands of dollars) and disk exhaustion from generated archives.
  • Some see this as de‑facto DDoS and call for treating it legally as such.
  • Others say it exposes fragile web apps: heavy SPAs, poor caching, inefficient git frontends; counter‑argument is that even well‑engineered sites can’t economically absorb abusive crawling.

Legal, Licensing, and Economic Debates

  • Dispute over whether training is “fair use” when the explicit goal is to compete with original authors.
  • Concerns that opaque models trained on GPL/FOSS code undermine copyleft; proposals for “no AI training” clauses, but these conflict with existing open‑source definitions and are likely to be ignored by bad actors.
  • Suggestions: lawsuits (copyright or DDoS), terms-of-service traps, collective rights assignment to enforcement orgs; skepticism about cost, uncertainty, and power imbalance.

Future of the Web and Governance

  • Many expect:
    • More content behind auth, payment, or verified identity; decline of anonymous access; stronger bot detection at CDNs.
    • Whitelisting of a few “trusted” crawlers (Google, Bing) and de‑facto exclusion of new entrants.
    • Further centralization (Cloudflare, big search/AI) and possible move toward browser attestation.
  • Philosophical split:
    • One side: “If it’s public, expect it to be scraped; design accordingly.”
    • Other side: sees AI firms as consciously externalizing costs, eroding the open web and FOSS goodwill, and pushing toward a feudal, enclosure‑style Internet.