2025-01-12

Zuckerberg approved training Llama on LibGen [pdf]

Meta, LibGen, and the LLaMA Lawsuit

Thread centers on court filings showing Meta leadership approved downloading LibGen (shadow library of pirated books) data for LLaMA training.
Some see this as straightforward, large‑scale, willful copyright infringement (downloading and distributing pirated works).
Others argue the key unresolved legal question is whether training on such data (as opposed to outputting it) violates copyright.

Copyright, Fair Use, and Model Training

One side:
- Training on unlicensed, pirated content is no different from any other copyright violation.
- Evidence of torrenting and seeding is particularly damning.
- “Free to use” models still underpin commercial products, so noncommercial rhetoric is irrelevant.
Other side:
- Models are not archives or compression of the training set; weights are tiny relative to input data.
- The real legal issue is reproducing copyrighted text in outputs, not ingesting it.
- Training is likened to a human learning from books, which is not restricted.

Big Tech vs Big Copyright and Power Asymmetry

Many highlight perceived hypocrisy: big tech aggressively enforces its own IP while ignoring others’.
Some expect an eventual narrow “AI training exemption” or compulsory licensing regime that entrenches big players and harms smaller competitors.
Comparison with other platforms (YouTube, Spotify, Reddit, Google Books) where initial piracy or uncompensated use eventually led to negotiated deals.

Shadow Libraries and Access to Knowledge

LibGen and similar sites are praised as de facto global research libraries, especially where paywalls and high per‑article prices block access.
Frustration that individuals have been heavily punished for similar behavior, while corporations quietly exploit the same resources.
Repeated references to past prosecutions over academic journal downloads to highlight “free for me, not for thee.”

Economic and Social Fallout

Concerns about creators’ livelihoods if training on copyrighted works is free and widespread.
Others argue royalties are already negligible in a saturated attention economy; copyright has been eroding since the internet.
Broader anxiety about AI, inequality, and whether responses like UBI or stronger IP enforcement are viable or will just benefit elites.

Related topics