Authors seek Meta's torrent client logs and seeding data in AI piracy probe

Allegations and BitTorrent Focus

  • Core discussion centers on claims that Meta used BitTorrent/LibGen to obtain large book corpora, and that plaintiffs now want torrent logs to prove seeding (redistribution).
  • Several comments stress that seeding/distribution is legally much more serious than mere downloading or training.
  • Others note that BitTorrent clients can disable upload, but plaintiffs in a civil case may rely on presumptions unless Meta can prove otherwise (e.g., config, testimony).

Fair Use, Copyright, and LLM Training

  • Ongoing debate whether using copyrighted texts to train LLMs is “fair use” or an infringement/derivative work.
  • Some argue training is like lossy compression or statistical analysis; the model holds abstractions, not full works.
  • Others point to clear memorization/regurgitation examples as evidence of infringement.
  • Distinction is made between:
    • copying for training,
    • model generation (possibly derivative),
    • and user actions (who actually uses or republishes outputs).

Piracy vs Legal Acquisition of Data

  • One side claims large-scale piracy is practically inevitable because buying/licensing millions of books is “too hard.”
  • Pushback is strong: big tech has enormous resources; they could bulk-license, buy publishers, or negotiate new rights instead of pirating.
  • Some note existing infrastructures like Google Books and ebook stores as potential legal sources.
  • Others argue that copyright doesn’t clearly grant authors a specific “no LLM training” right, so even licensed ebook sales may not resolve the issue.

Technical Approaches and Workarounds

  • Mention of synthetic data, summaries, QA pairs, and federated learning as ways to reduce direct exposure to copyrighted text.
  • Concerns about “knowledge collapse” if models are trained only on synthetic data.
  • Some see hybrid organic/synthetic training and filtered corpora as the emerging norm.

Ethical and Societal Themes

  • Strong criticism of “move fast and break things” and perceived hypocrisy: tech firms restricting training on their outputs while freely training on others’ work.
  • Hypotheticals about extreme data collection (e.g., scanning everyone’s brain for AGI) are broadly rejected as unethical.
  • A minority welcomes these cases as potential catalysts to weaken or abolish current copyright; others expect outcomes that favor large corporations over individuals.