Nvidia contacted Anna's Archive to access books
Legal status of training on copyrighted books
- Several comments debate whether using pirated books for AI training can be defended as “fair use” when the model only keeps “statistical correlations.”
- Some argue this fits existing precedents (e.g., book scanning and search), since the models aren’t meant to redistribute the original texts, only extract patterns.
- Others counter that even scanning/downloading is already reproduction, and obtaining works from pirate libraries is illegal regardless of what you do afterward.
- There’s disagreement over whether the legal problem lies at the input stage (copying works) or the output stage (reproducing copyrighted text).
Human reading vs machine training
- A recurring analogy compares AI training to a person reading and remembering books.
- One side says calling training illegal is like criminalizing human memory, and law doesn’t distinguish by scale (one book vs millions).
- Critics reply that law often does treat scale as a proxy for intent (e.g., drugs), and slurping “every single piece of produced content” is categorically different.
- Many note a key distinction: humans can’t reliably reproduce long works verbatim, whereas models can be induced to regurgitate large copyrighted passages, as shown in cited research.
Scale, intent, and ambiguity of copyright
- Commenters note copyright law is underspecified for LLMs; outcomes are “unclear” and highly dependent on future cases.
- Some argue models were “intended” to produce legal, transformative output and that infringing outputs are side effects; others say corporations routinely accept legal risk for profit.
Source of data: piracy vs legal channels
- Strong criticism centers on a trillion‑dollar company using pirate libraries instead of paying publishers or authors.
- Others stress practical incentives: there is no ready-made, licensed corpus product; negotiating with every publisher is complex and costly, while Anna’s Archive offers a single 500 TB firehose.
- Some point to alternative legal paths (buying and scanning physical books, then destroying them), but acknowledge this is expensive and politically fraught.
Power imbalance and broader implications
- Multiple comments highlight that “laws are for the poor”: individuals are punished for piracy while large firms do mass infringement with minimal consequences.
- There’s resentment that AI systems trained on these books now automate or devalue the work of the very authors whose texts were copied.
- The episode is seen as evidence of how desperate AI companies are for high‑quality data, contradicting narratives that synthetic data will soon suffice.