Extracting memorized pieces of books from open-weight language models

Scope of Infringement: Training vs Input vs Output

  • One camp argues any unlicensed use of a book in training is already infringement (like feeding a pirated ebook into scripts), regardless of what the model can reproduce.
  • Others distinguish:
    • (a) illegal acquisition (e.g., torrenting “The Pile”),
    • (b) training as internal processing (argued to be non‑infringing, akin to reading/learning),
    • (c) user‑facing outputs that might violate copyright.
  • Some say the only legally clean route is explicit licensing of works for training.

LLMs, Humans, and Tools

  • Many analogies: humans memorizing poems, artists drawing Mickey Mouse, brains “encoding” movies.
  • Counterpoint: law already treats humans vs machines differently (e.g., AI art not copyrightable), so human analogies may be misleading.
  • Others compare LLMs to photocopiers, search engines, JPEG compression, panoramic photos with copyrighted objects, or fuzzy text databases.

Memorization, Compression, and the Paper’s Results

  • Discussion that models can’t “store” full training corpora (weights << corpus), so memorization is sparse and concentrated in repeated/popular texts.
  • Harry Potter and 1984 being “almost entirely” recoverable may reflect extreme repetition and many online quotes, not single-copy ingestion.
  • The paper notes: extraction is probabilistic and expensive (hundreds/thousands of prompts), so deliberate verbatim extraction is impractical in normal use.
  • Some see LLMs as “lossy compressed, queryable databases of training data”; others emphasize their generative/transformative aspects.

Liability and Responsibility

  • Disagreement over who is liable when infringing text appears:
    • user (like someone misusing a tool),
    • model provider (who ingested the copyrighted works),
    • or both (analogy to Napster and secondary liability).
  • Debate whether safety filters and treating verbatim reproduction as a “failure state” meaningfully change the legal picture.

Fair Use, Transformative Use, and Market Harm

  • References to Google Books and transformative fair use as a possible shield, especially in the US; others note jurisdictions without such doctrines.
  • Arguments that LLMs don’t practically substitute for entire books vs claims that they are marketed as partial replacements (code, genre fiction, images).
  • Some fear that if copyright maximalists “win,” fair use will shrink in ways that harm non‑AI art and creativity.

Power, Law, and Likely Outcomes

  • Several comments emphasize that large tech firms’ deep pockets and geopolitical narratives (“China will beat us”) may shape eventual doctrine more than clean legal theory.
  • Others predict “death by a thousand paper cuts” from many small infringement suits once verbatim or substantially similar outputs can be demonstrated.