Nvidia contacted Anna's Archive to access books

Legal status of training on copyrighted books

  • Several comments debate whether using pirated books for AI training can be defended as “fair use” when the model only keeps “statistical correlations.”
  • Some argue this fits existing precedents (e.g., book scanning and search), since the models aren’t meant to redistribute the original texts, only extract patterns.
  • Others counter that even scanning/downloading is already reproduction, and obtaining works from pirate libraries is illegal regardless of what you do afterward.
  • There’s disagreement over whether the legal problem lies at the input stage (copying works) or the output stage (reproducing copyrighted text).

Human reading vs machine training

  • A recurring analogy compares AI training to a person reading and remembering books.
  • One side says calling training illegal is like criminalizing human memory, and law doesn’t distinguish by scale (one book vs millions).
  • Critics reply that law often does treat scale as a proxy for intent (e.g., drugs), and slurping “every single piece of produced content” is categorically different.
  • Many note a key distinction: humans can’t reliably reproduce long works verbatim, whereas models can be induced to regurgitate large copyrighted passages, as shown in cited research.

Scale, intent, and ambiguity of copyright

  • Commenters note copyright law is underspecified for LLMs; outcomes are “unclear” and highly dependent on future cases.
  • Some argue models were “intended” to produce legal, transformative output and that infringing outputs are side effects; others say corporations routinely accept legal risk for profit.

Source of data: piracy vs legal channels

  • Strong criticism centers on a trillion‑dollar company using pirate libraries instead of paying publishers or authors.
  • Others stress practical incentives: there is no ready-made, licensed corpus product; negotiating with every publisher is complex and costly, while Anna’s Archive offers a single 500 TB firehose.
  • Some point to alternative legal paths (buying and scanning physical books, then destroying them), but acknowledge this is expensive and politically fraught.

Power imbalance and broader implications

  • Multiple comments highlight that “laws are for the poor”: individuals are punished for piracy while large firms do mass infringement with minimal consequences.
  • There’s resentment that AI systems trained on these books now automate or devalue the work of the very authors whose texts were copied.
  • The episode is seen as evidence of how desperate AI companies are for high‑quality data, contradicting narratives that synthetic data will soon suffice.