2025-04-15

You cannot have our user's data

Public data vs. control

Some argue that once data is on the public web, you effectively lose control over its propagation; treating public content as non-public is seen as unrealistic.
Others push back that “public but not like that” is still meaningful: users can object to large-scale scraping, mass replication, and attention-diverting derivatives even if individual copying is inevitable.
Analogies are made to a public square or a restaurant mint bowl: public access doesn’t imply unlimited industrial-scale extraction.

Copyright, law, and jurisdiction

Many point to copyright (and Berne) as the legal mechanism for “public but controlled.”
Counterpoints: copyright mostly constrains redistribution, not private use; enforcement is hard across borders and with botnets.
Some stress that simply adding “no AI training” clauses or licenses is useless without actually litigating; others note real limits on suing actors in places like Russia or China.

Resource abuse and crawlers

Broad agreement that blocking badly behaved crawlers (AI-related or not) is legitimate: they can generate huge bandwidth bills and denial-of-service conditions.
Some emphasize that LLM crawlers aren’t “the public” when they effectively crowd out human users by saturating bandwidth.
There’s frustration that scrapers repeatedly hit mostly static sites with no apparent benefit.

Host neutrality vs. anti-AI stance

One side welcomes SourceHut’s explicit ban on ML training use, seeing it as defending users and infrastructure from exploitative “Big Tech.”
Another side, including maintainers of permissively licensed projects, dislikes hosts imposing blanket anti-ML terms on code they don’t own; they want maximum visibility, including via LLMs.
Debate over whether hosts should strive for maximal neutrality vs. aligning with particular ethical/political positions.

Cloudflare and “racketeering” framing

Some suggest the “racketeer” label refers to Cloudflare both selling AI services and selling protection against AI scrapers, and similarly offering DDoS protection while fronting for DDoS-for-hire sites.
Others recall high pricing when SourceHut sought Cloudflare help and cite criticism that Cloudflare benefits from widespread attacks.

Licenses and LLMs

People speculate about licenses that would force trained models to be open and outputs to be open source; several think such clauses would be unenforceable or struck down as unfair.
Disagreement over whether LLM training is or will be legally “fair use,” whether models are derivative works, and whether model outputs are copyrightable remains unresolved and labeled as legally unclear.
Some argue GPL contamination might already apply to many models; others note courts seem to demand proof of concrete damages.

Anubis proof-of-work and browser issues

Anubis, used by SourceHut, employs a multi-threaded proof-of-work in browsers to distinguish humans from bots.
Critics say this becomes de facto gatekeeping against older or nonstandard browsers and contradicts SourceHut’s “no JavaScript required” positioning.
Defenders argue any modern browser can implement it and that UA checks are primarily an optimization; proof-of-work is the real barrier for large-scale scraping, not genuine users.

Miscellaneous ideas and concerns

Suggestions include scraper tarpits that feed infinite “poison” training data or redirect proof-of-work into mining for site owners.
Some wonder how sure anyone is that specific heavy traffic is from LLM scrapers vs. plain DDoS with plausible deniability.
A few promote more distributed, self-contained VCS systems (e.g., Fossil-like) as a structural response to centralized scraping and hosting constraints.

Related topics