2025-01-20

Authors seek Meta's torrent client logs and seeding data in AI piracy probe

Allegations and BitTorrent Focus

Core discussion centers on claims that Meta used BitTorrent/LibGen to obtain large book corpora, and that plaintiffs now want torrent logs to prove seeding (redistribution).
Several comments stress that seeding/distribution is legally much more serious than mere downloading or training.
Others note that BitTorrent clients can disable upload, but plaintiffs in a civil case may rely on presumptions unless Meta can prove otherwise (e.g., config, testimony).

Fair Use, Copyright, and LLM Training

Ongoing debate whether using copyrighted texts to train LLMs is “fair use” or an infringement/derivative work.
Some argue training is like lossy compression or statistical analysis; the model holds abstractions, not full works.
Others point to clear memorization/regurgitation examples as evidence of infringement.
Distinction is made between:
- copying for training,
- model generation (possibly derivative),
- and user actions (who actually uses or republishes outputs).

Piracy vs Legal Acquisition of Data

One side claims large-scale piracy is practically inevitable because buying/licensing millions of books is “too hard.”
Pushback is strong: big tech has enormous resources; they could bulk-license, buy publishers, or negotiate new rights instead of pirating.
Some note existing infrastructures like Google Books and ebook stores as potential legal sources.
Others argue that copyright doesn’t clearly grant authors a specific “no LLM training” right, so even licensed ebook sales may not resolve the issue.

Technical Approaches and Workarounds

Mention of synthetic data, summaries, QA pairs, and federated learning as ways to reduce direct exposure to copyrighted text.
Concerns about “knowledge collapse” if models are trained only on synthetic data.
Some see hybrid organic/synthetic training and filtered corpora as the emerging norm.

Ethical and Societal Themes

Strong criticism of “move fast and break things” and perceived hypocrisy: tech firms restricting training on their outputs while freely training on others’ work.
Hypotheticals about extreme data collection (e.g., scanning everyone’s brain for AGI) are broadly rejected as unethical.
A minority welcomes these cases as potential catalysts to weaken or abolish current copyright; others expect outcomes that favor large corporations over individuals.

Related topics