Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs

Technical behavior and memorization

  • Thread centers on a paper/demo showing that with targeted finetuning, LLMs can be prompted to recall long, near-verbatim passages of copyrighted books.
  • Some note similar personal observations: models recognizing scanned book pages, auto-completing famous openings, or spitting out web text verbatim in niche contexts.
  • Others argue some prompts “cheat” by encoding plot details so densely that the model is basically guided to reconstruct text, questioning what this really proves about internal storage.
  • There is discussion of whether LLMs are truly injective/invertible and whether that matters for “thinking” vs simple memorization and interpolation.

Copyright, copyleft, and public domain

  • Large debate over whether this behavior is a fundamental threat to copyright or evidence that modern copyright (especially long durations) is already broken.
  • Some argue “intelligence is compression,” and that training on copyrighted works is inevitable and socially beneficial; others worry it dismantles the economic base for writing, journalism, and creative work.
  • Big subthread on how copyright enables copyleft/GPL and Creative Commons; some say abolishing copyright would also kill these, others respond that copyleft is an exploit of copyright and could be replaced by rights-based regulation (e.g., mandated source access).
  • Many criticize perpetual extensions beyond the original short terms, saying landmark works should already be public domain.

Shadow libraries, access, and AI

  • One researcher admits systematically scanning and uploading books to shadow libraries and is enthusiastic that LLMs trained on them will answer obscure scholarly questions.
  • Critics question legality and ethics, especially when this indirectly helps commercial AI; defenders counter that shadow libraries are now ubiquitous in academia and often more usable than official holdings.
  • Debate over whether AI paywalls will further defund public libraries versus coexist with them; several point out local and open-weight models as a counterbalance.

Legal and societal outcomes

  • Some expect a “Napster moment” when users are sued for redistributing infringing LLM outputs; others think powerful tech and AI companies will simply reshape copyright law.
  • Comparisons to file sharing: enforcement reduced mass piracy but didn’t eliminate it; analogies drawn to future AI regulation and commoditization of models.