Authors seek Meta's torrent client logs and seeding data in AI piracy probe
Allegations and BitTorrent Focus
- Core discussion centers on claims that Meta used BitTorrent/LibGen to obtain large book corpora, and that plaintiffs now want torrent logs to prove seeding (redistribution).
- Several comments stress that seeding/distribution is legally much more serious than mere downloading or training.
- Others note that BitTorrent clients can disable upload, but plaintiffs in a civil case may rely on presumptions unless Meta can prove otherwise (e.g., config, testimony).
Fair Use, Copyright, and LLM Training
- Ongoing debate whether using copyrighted texts to train LLMs is “fair use” or an infringement/derivative work.
- Some argue training is like lossy compression or statistical analysis; the model holds abstractions, not full works.
- Others point to clear memorization/regurgitation examples as evidence of infringement.
- Distinction is made between:
- copying for training,
- model generation (possibly derivative),
- and user actions (who actually uses or republishes outputs).
Piracy vs Legal Acquisition of Data
- One side claims large-scale piracy is practically inevitable because buying/licensing millions of books is “too hard.”
- Pushback is strong: big tech has enormous resources; they could bulk-license, buy publishers, or negotiate new rights instead of pirating.
- Some note existing infrastructures like Google Books and ebook stores as potential legal sources.
- Others argue that copyright doesn’t clearly grant authors a specific “no LLM training” right, so even licensed ebook sales may not resolve the issue.
Technical Approaches and Workarounds
- Mention of synthetic data, summaries, QA pairs, and federated learning as ways to reduce direct exposure to copyrighted text.
- Concerns about “knowledge collapse” if models are trained only on synthetic data.
- Some see hybrid organic/synthetic training and filtered corpora as the emerging norm.
Ethical and Societal Themes
- Strong criticism of “move fast and break things” and perceived hypocrisy: tech firms restricting training on their outputs while freely training on others’ work.
- Hypotheticals about extreme data collection (e.g., scanning everyone’s brain for AGI) are broadly rejected as unethical.
- A minority welcomes these cases as potential catalysts to weaken or abolish current copyright; others expect outcomes that favor large corporations over individuals.