Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book
Measurement and what “42%” actually means
- Commenters stress the paper’s claim is often misread: Llama 3.1 doesn’t output 42% of the book on command.
- Method: given a 50‑token passage from the book, the model is checked for how often it assigns ≥50% probability to the exact next 50 tokens. Doing this across sliding windows yields that 42% of the book’s positions meet this criterion.
- That’s closer to “it can often guess the next sentence from the previous one” than “it can recite half the book.” Many argue the title is misleading or “nothing‑burger” in that sense.
Memorization, compression, and overfitting
- Debate over whether this is “memorization,” “recall,” or just good language modeling of very stereotypical prose.
- Some see it as evidence LLMs are extremely effective lossy compressors of text; others note the reconstruction still requires substantial information provided by the prompter.
- A few call this overfitting and a bug; others say mild memorization of popular texts is expected and even useful, as long as lawsuits are avoided.
- There’s discussion that capacity is limited; models can’t memorize all training data, so heavy recall of HP versus other books implies particularly heavy exposure (multiple copies, fanfic, quotes, etc.).
Training data sources and alternative explanations
- Several comments point out that Harry Potter appears everywhere online: full pirated copies, quote sites, fanfiction archives, reviews, forums, wikis.
- Others counter that memorizing nearly half the book suggests more than just a few famous quotes; likely the full text (or many near‑full copies) were in the training set.
- Evidence is cited that Meta previously used pirated book datasets like Books3/LibGen for earlier LLaMA versions; whether later models are “clean” is contested.
Copyright, fair use, and legal theories
- Central thread: does training on copyrighted books, and then being able to reproduce chunks, infringe copyright?
- Three legal angles are cited: (1) copying during training, (2) the model itself as a derivative work, (3) infringing outputs.
- Some argue a 50‑token snippet is trivial and often fair use; others note 42% coverage and analogies to compressed archives (.zip, .rar, autoencoders) and say a lossy copy is still a copy.
- Human‑memory analogies (people memorizing books) are widely invoked; critics respond that law treats humans and machines differently and that scale and commercial use matter.
- Views range from “training on public web text should be allowed” to calls for strict licensing, to radical positions that copyright should be sharply shortened or abolished altogether.
Broader implications
- Concerns that LLMs weaken incentives for writers and other creators, versus counter‑claims that current copyright already mostly benefits large corporations.
- Some note that only open‑weight models allow this kind of audit; closed models may have similar issues but are harder to probe.