You cannot have our user's data

Public data vs. control

  • Some argue that once data is on the public web, you effectively lose control over its propagation; treating public content as non-public is seen as unrealistic.
  • Others push back that “public but not like that” is still meaningful: users can object to large-scale scraping, mass replication, and attention-diverting derivatives even if individual copying is inevitable.
  • Analogies are made to a public square or a restaurant mint bowl: public access doesn’t imply unlimited industrial-scale extraction.

Copyright, law, and jurisdiction

  • Many point to copyright (and Berne) as the legal mechanism for “public but controlled.”
  • Counterpoints: copyright mostly constrains redistribution, not private use; enforcement is hard across borders and with botnets.
  • Some stress that simply adding “no AI training” clauses or licenses is useless without actually litigating; others note real limits on suing actors in places like Russia or China.

Resource abuse and crawlers

  • Broad agreement that blocking badly behaved crawlers (AI-related or not) is legitimate: they can generate huge bandwidth bills and denial-of-service conditions.
  • Some emphasize that LLM crawlers aren’t “the public” when they effectively crowd out human users by saturating bandwidth.
  • There’s frustration that scrapers repeatedly hit mostly static sites with no apparent benefit.

Host neutrality vs. anti-AI stance

  • One side welcomes SourceHut’s explicit ban on ML training use, seeing it as defending users and infrastructure from exploitative “Big Tech.”
  • Another side, including maintainers of permissively licensed projects, dislikes hosts imposing blanket anti-ML terms on code they don’t own; they want maximum visibility, including via LLMs.
  • Debate over whether hosts should strive for maximal neutrality vs. aligning with particular ethical/political positions.

Cloudflare and “racketeering” framing

  • Some suggest the “racketeer” label refers to Cloudflare both selling AI services and selling protection against AI scrapers, and similarly offering DDoS protection while fronting for DDoS-for-hire sites.
  • Others recall high pricing when SourceHut sought Cloudflare help and cite criticism that Cloudflare benefits from widespread attacks.

Licenses and LLMs

  • People speculate about licenses that would force trained models to be open and outputs to be open source; several think such clauses would be unenforceable or struck down as unfair.
  • Disagreement over whether LLM training is or will be legally “fair use,” whether models are derivative works, and whether model outputs are copyrightable remains unresolved and labeled as legally unclear.
  • Some argue GPL contamination might already apply to many models; others note courts seem to demand proof of concrete damages.

Anubis proof-of-work and browser issues

  • Anubis, used by SourceHut, employs a multi-threaded proof-of-work in browsers to distinguish humans from bots.
  • Critics say this becomes de facto gatekeeping against older or nonstandard browsers and contradicts SourceHut’s “no JavaScript required” positioning.
  • Defenders argue any modern browser can implement it and that UA checks are primarily an optimization; proof-of-work is the real barrier for large-scale scraping, not genuine users.

Miscellaneous ideas and concerns

  • Suggestions include scraper tarpits that feed infinite “poison” training data or redirect proof-of-work into mining for site owners.
  • Some wonder how sure anyone is that specific heavy traffic is from LLM scrapers vs. plain DDoS with plausible deniability.
  • A few promote more distributed, self-contained VCS systems (e.g., Fossil-like) as a structural response to centralized scraping and hosting constraints.