AIs can generate near-verbatim copies of novels from training data
Technical capability & memorization
- Some argue it’s unsurprising: if models are next-token predictors, any novel in the training set is just one valid token sequence, so there must exist prompts that elicit it.
- Others counter that predicting an unseen novel verbatim is astronomically unlikely; being able to do so strongly implies the text was in training data.
- Several commenters emphasize LLMs as lossy compressors (pigeonhole principle), not perfect archives; verbatim output probability depends on how often and how redundantly a string appeared during training.
- Reported results include long “near-verbatim blocks” of thousands of tokens from famous novels, not just single-sentence completions.
Significance of the results
- One camp calls this a “nothingburger”: 70% sentence-level match and imperfect runs mean you still need the original to reconstruct a clean book.
- Others think it’s significant legally and evidentially: being able to extract 70%+ or multi‑thousand‑token continuous chunks is likely enough to prove inclusion in training data and to interest litigators.
- There’s interest in whether less-popular, sparsely quoted books can be similarly reproduced; that would be more worrying.
Copyright: is training a “copy”?
- Strong view: the violation occurs at copying into the training set, regardless of later output or transformation. If the model weights encode works that can be reproduced, they contain copies.
- Counter-view: models are more akin to humans who read and “learn” from books; the key legal issue should be downstream distribution, not mere internalization.
- Disagreement over fair use: some think training will ultimately be justified as transformative; others think copyright law’s text (e.g., US “copy” definition) clearly covers model weights.
Guardrails, jailbreaks, and liability
- Paper notes some models require jailbreaking to extract text; others comply with simple continuation prompts.
- Debate over whether needing jailbreaks counts as “circumventing a protection system” or just abusing a weak safety layer.
- Some argue liability should fall on the user who coerces the model into infringement; others say providers are responsible if their product readily enables mass reproduction.
Human vs machine analogies
- Frequent comparisons to humans memorizing books, singing songs, or writing fanfic; critics respond that:
- Computers are explicitly covered as “machines/devices” in copyright law, unlike human memory.
- Human-scale memorization/distribution is rare and low‑impact; LLMs scale to millions of perfect or near‑perfect copies.
- General consensus: “humans also do this” is rhetorically appealing but legally weak.
Broader framings
- Some frame LLMs as super‑compressed libraries or search/index systems over the internet, now leaking underlying works.
- Others see this simply as large‑scale, automated plagiarism, enabled by messy, heavily lobbied copyright law that has repeatedly lagged technological shifts.