AIs can generate near-verbatim copies of novels from training data

Technical capability & memorization

  • Some argue it’s unsurprising: if models are next-token predictors, any novel in the training set is just one valid token sequence, so there must exist prompts that elicit it.
  • Others counter that predicting an unseen novel verbatim is astronomically unlikely; being able to do so strongly implies the text was in training data.
  • Several commenters emphasize LLMs as lossy compressors (pigeonhole principle), not perfect archives; verbatim output probability depends on how often and how redundantly a string appeared during training.
  • Reported results include long “near-verbatim blocks” of thousands of tokens from famous novels, not just single-sentence completions.

Significance of the results

  • One camp calls this a “nothingburger”: 70% sentence-level match and imperfect runs mean you still need the original to reconstruct a clean book.
  • Others think it’s significant legally and evidentially: being able to extract 70%+ or multi‑thousand‑token continuous chunks is likely enough to prove inclusion in training data and to interest litigators.
  • There’s interest in whether less-popular, sparsely quoted books can be similarly reproduced; that would be more worrying.

Copyright: is training a “copy”?

  • Strong view: the violation occurs at copying into the training set, regardless of later output or transformation. If the model weights encode works that can be reproduced, they contain copies.
  • Counter-view: models are more akin to humans who read and “learn” from books; the key legal issue should be downstream distribution, not mere internalization.
  • Disagreement over fair use: some think training will ultimately be justified as transformative; others think copyright law’s text (e.g., US “copy” definition) clearly covers model weights.

Guardrails, jailbreaks, and liability

  • Paper notes some models require jailbreaking to extract text; others comply with simple continuation prompts.
  • Debate over whether needing jailbreaks counts as “circumventing a protection system” or just abusing a weak safety layer.
  • Some argue liability should fall on the user who coerces the model into infringement; others say providers are responsible if their product readily enables mass reproduction.

Human vs machine analogies

  • Frequent comparisons to humans memorizing books, singing songs, or writing fanfic; critics respond that:
    • Computers are explicitly covered as “machines/devices” in copyright law, unlike human memory.
    • Human-scale memorization/distribution is rare and low‑impact; LLMs scale to millions of perfect or near‑perfect copies.
  • General consensus: “humans also do this” is rhetorically appealing but legally weak.

Broader framings

  • Some frame LLMs as super‑compressed libraries or search/index systems over the internet, now leaking underlying works.
  • Others see this simply as large‑scale, automated plagiarism, enabled by messy, heavily lobbied copyright law that has repeatedly lagged technological shifts.