2025-06-16

Extracting memorized pieces of books from open-weight language models

Scope of Infringement: Training vs Input vs Output

One camp argues any unlicensed use of a book in training is already infringement (like feeding a pirated ebook into scripts), regardless of what the model can reproduce.
Others distinguish:
- (a) illegal acquisition (e.g., torrenting “The Pile”),
- (b) training as internal processing (argued to be non‑infringing, akin to reading/learning),
- (c) user‑facing outputs that might violate copyright.
Some say the only legally clean route is explicit licensing of works for training.

LLMs, Humans, and Tools

Many analogies: humans memorizing poems, artists drawing Mickey Mouse, brains “encoding” movies.
Counterpoint: law already treats humans vs machines differently (e.g., AI art not copyrightable), so human analogies may be misleading.
Others compare LLMs to photocopiers, search engines, JPEG compression, panoramic photos with copyrighted objects, or fuzzy text databases.

Memorization, Compression, and the Paper’s Results

Discussion that models can’t “store” full training corpora (weights << corpus), so memorization is sparse and concentrated in repeated/popular texts.
Harry Potter and 1984 being “almost entirely” recoverable may reflect extreme repetition and many online quotes, not single-copy ingestion.
The paper notes: extraction is probabilistic and expensive (hundreds/thousands of prompts), so deliberate verbatim extraction is impractical in normal use.
Some see LLMs as “lossy compressed, queryable databases of training data”; others emphasize their generative/transformative aspects.

Liability and Responsibility

Disagreement over who is liable when infringing text appears:
- user (like someone misusing a tool),
- model provider (who ingested the copyrighted works),
- or both (analogy to Napster and secondary liability).
Debate whether safety filters and treating verbatim reproduction as a “failure state” meaningfully change the legal picture.

Fair Use, Transformative Use, and Market Harm

References to Google Books and transformative fair use as a possible shield, especially in the US; others note jurisdictions without such doctrines.
Arguments that LLMs don’t practically substitute for entire books vs claims that they are marketed as partial replacements (code, genre fiction, images).
Some fear that if copyright maximalists “win,” fair use will shrink in ways that harm non‑AI art and creativity.

Power, Law, and Likely Outcomes

Several comments emphasize that large tech firms’ deep pockets and geopolitical narratives (“China will beat us”) may shape eventual doctrine more than clean legal theory.
Others predict “death by a thousand paper cuts” from many small infringement suits once verbatim or substantially similar outputs can be demonstrated.

Related topics