2026-02-23

AIs can generate near-verbatim copies of novels from training data

Technical capability & memorization

Some argue it’s unsurprising: if models are next-token predictors, any novel in the training set is just one valid token sequence, so there must exist prompts that elicit it.
Others counter that predicting an unseen novel verbatim is astronomically unlikely; being able to do so strongly implies the text was in training data.
Several commenters emphasize LLMs as lossy compressors (pigeonhole principle), not perfect archives; verbatim output probability depends on how often and how redundantly a string appeared during training.
Reported results include long “near-verbatim blocks” of thousands of tokens from famous novels, not just single-sentence completions.

Significance of the results

One camp calls this a “nothingburger”: 70% sentence-level match and imperfect runs mean you still need the original to reconstruct a clean book.
Others think it’s significant legally and evidentially: being able to extract 70%+ or multi‑thousand‑token continuous chunks is likely enough to prove inclusion in training data and to interest litigators.
There’s interest in whether less-popular, sparsely quoted books can be similarly reproduced; that would be more worrying.

Copyright: is training a “copy”?

Strong view: the violation occurs at copying into the training set, regardless of later output or transformation. If the model weights encode works that can be reproduced, they contain copies.
Counter-view: models are more akin to humans who read and “learn” from books; the key legal issue should be downstream distribution, not mere internalization.
Disagreement over fair use: some think training will ultimately be justified as transformative; others think copyright law’s text (e.g., US “copy” definition) clearly covers model weights.

Guardrails, jailbreaks, and liability

Paper notes some models require jailbreaking to extract text; others comply with simple continuation prompts.
Debate over whether needing jailbreaks counts as “circumventing a protection system” or just abusing a weak safety layer.
Some argue liability should fall on the user who coerces the model into infringement; others say providers are responsible if their product readily enables mass reproduction.

Human vs machine analogies

Frequent comparisons to humans memorizing books, singing songs, or writing fanfic; critics respond that:
- Computers are explicitly covered as “machines/devices” in copyright law, unlike human memory.
- Human-scale memorization/distribution is rare and low‑impact; LLMs scale to millions of perfect or near‑perfect copies.
General consensus: “humans also do this” is rhetorically appealing but legally weak.

Broader framings

Some frame LLMs as super‑compressed libraries or search/index systems over the internet, now leaking underlying works.
Others see this simply as large‑scale, automated plagiarism, enabled by messy, heavily lobbied copyright law that has repeatedly lagged technological shifts.

Related topics