Extracting memorized pieces of books from open-weight language models
Scope of Infringement: Training vs Input vs Output
- One camp argues any unlicensed use of a book in training is already infringement (like feeding a pirated ebook into scripts), regardless of what the model can reproduce.
- Others distinguish:
- (a) illegal acquisition (e.g., torrenting “The Pile”),
- (b) training as internal processing (argued to be non‑infringing, akin to reading/learning),
- (c) user‑facing outputs that might violate copyright.
- Some say the only legally clean route is explicit licensing of works for training.
LLMs, Humans, and Tools
- Many analogies: humans memorizing poems, artists drawing Mickey Mouse, brains “encoding” movies.
- Counterpoint: law already treats humans vs machines differently (e.g., AI art not copyrightable), so human analogies may be misleading.
- Others compare LLMs to photocopiers, search engines, JPEG compression, panoramic photos with copyrighted objects, or fuzzy text databases.
Memorization, Compression, and the Paper’s Results
- Discussion that models can’t “store” full training corpora (weights << corpus), so memorization is sparse and concentrated in repeated/popular texts.
- Harry Potter and 1984 being “almost entirely” recoverable may reflect extreme repetition and many online quotes, not single-copy ingestion.
- The paper notes: extraction is probabilistic and expensive (hundreds/thousands of prompts), so deliberate verbatim extraction is impractical in normal use.
- Some see LLMs as “lossy compressed, queryable databases of training data”; others emphasize their generative/transformative aspects.
Liability and Responsibility
- Disagreement over who is liable when infringing text appears:
- user (like someone misusing a tool),
- model provider (who ingested the copyrighted works),
- or both (analogy to Napster and secondary liability).
- Debate whether safety filters and treating verbatim reproduction as a “failure state” meaningfully change the legal picture.
Fair Use, Transformative Use, and Market Harm
- References to Google Books and transformative fair use as a possible shield, especially in the US; others note jurisdictions without such doctrines.
- Arguments that LLMs don’t practically substitute for entire books vs claims that they are marketed as partial replacements (code, genre fiction, images).
- Some fear that if copyright maximalists “win,” fair use will shrink in ways that harm non‑AI art and creativity.
Power, Law, and Likely Outcomes
- Several comments emphasize that large tech firms’ deep pockets and geopolitical narratives (“China will beat us”) may shape eventual doctrine more than clean legal theory.
- Others predict “death by a thousand paper cuts” from many small infringement suits once verbatim or substantially similar outputs can be demonstrated.