2026-01-19

Nvidia contacted Anna's Archive to access books

Legal status of training on copyrighted books

Several comments debate whether using pirated books for AI training can be defended as “fair use” when the model only keeps “statistical correlations.”
Some argue this fits existing precedents (e.g., book scanning and search), since the models aren’t meant to redistribute the original texts, only extract patterns.
Others counter that even scanning/downloading is already reproduction, and obtaining works from pirate libraries is illegal regardless of what you do afterward.
There’s disagreement over whether the legal problem lies at the input stage (copying works) or the output stage (reproducing copyrighted text).

Human reading vs machine training

A recurring analogy compares AI training to a person reading and remembering books.
One side says calling training illegal is like criminalizing human memory, and law doesn’t distinguish by scale (one book vs millions).
Critics reply that law often does treat scale as a proxy for intent (e.g., drugs), and slurping “every single piece of produced content” is categorically different.
Many note a key distinction: humans can’t reliably reproduce long works verbatim, whereas models can be induced to regurgitate large copyrighted passages, as shown in cited research.

Scale, intent, and ambiguity of copyright

Commenters note copyright law is underspecified for LLMs; outcomes are “unclear” and highly dependent on future cases.
Some argue models were “intended” to produce legal, transformative output and that infringing outputs are side effects; others say corporations routinely accept legal risk for profit.

Source of data: piracy vs legal channels

Strong criticism centers on a trillion‑dollar company using pirate libraries instead of paying publishers or authors.
Others stress practical incentives: there is no ready-made, licensed corpus product; negotiating with every publisher is complex and costly, while Anna’s Archive offers a single 500 TB firehose.
Some point to alternative legal paths (buying and scanning physical books, then destroying them), but acknowledge this is expensive and politically fraught.

Power imbalance and broader implications

Multiple comments highlight that “laws are for the poor”: individuals are punished for piracy while large firms do mass infringement with minimal consequences.
There’s resentment that AI systems trained on these books now automate or devalue the work of the very authors whose texts were copied.
The episode is seen as evidence of how desperate AI companies are for high‑quality data, contradicting narratives that synthetic data will soon suffice.

Related topics