2026-04-30

Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs

Technical behavior and memorization

Thread centers on a paper/demo showing that with targeted finetuning, LLMs can be prompted to recall long, near-verbatim passages of copyrighted books.
Some note similar personal observations: models recognizing scanned book pages, auto-completing famous openings, or spitting out web text verbatim in niche contexts.
Others argue some prompts “cheat” by encoding plot details so densely that the model is basically guided to reconstruct text, questioning what this really proves about internal storage.
There is discussion of whether LLMs are truly injective/invertible and whether that matters for “thinking” vs simple memorization and interpolation.

Copyright, copyleft, and public domain

Large debate over whether this behavior is a fundamental threat to copyright or evidence that modern copyright (especially long durations) is already broken.
Some argue “intelligence is compression,” and that training on copyrighted works is inevitable and socially beneficial; others worry it dismantles the economic base for writing, journalism, and creative work.
Big subthread on how copyright enables copyleft/GPL and Creative Commons; some say abolishing copyright would also kill these, others respond that copyleft is an exploit of copyright and could be replaced by rights-based regulation (e.g., mandated source access).
Many criticize perpetual extensions beyond the original short terms, saying landmark works should already be public domain.

Shadow libraries, access, and AI

One researcher admits systematically scanning and uploading books to shadow libraries and is enthusiastic that LLMs trained on them will answer obscure scholarly questions.
Critics question legality and ethics, especially when this indirectly helps commercial AI; defenders counter that shadow libraries are now ubiquitous in academia and often more usable than official holdings.
Debate over whether AI paywalls will further defund public libraries versus coexist with them; several point out local and open-weight models as a counterbalance.

Legal and societal outcomes

Some expect a “Napster moment” when users are sued for redistributing infringing LLM outputs; others think powerful tech and AI companies will simply reshape copyright law.
Comparisons to file sharing: enforcement reduced mass piracy but didn’t eliminate it; analogies drawn to future AI regulation and commoditization of models.

Related topics