Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge
Legal Status of Training on Copyrighted Books
- Many comments focus on the judge’s finding: scanning owned books and using them to train models was “exceedingly transformative” and fair use, while downloading pirated copies was not.
- Several point out this follows existing precedents (e.g., Google Books, search indexing): making internal digital copies and using them for a transformative tool can be fair use even when entire works are scanned.
- Others argue the law is unsettled: only early lower‑court rulings exist, some other cases are less friendly to AI, and a Supreme Court test on LLM training/output is widely expected.
Piracy vs. Fair Use and the Judge’s Ruling
- Key distinction drawn:
- (1) Pirating books to build a digital library = clear infringement, possibly criminal at this scale.
- (2) Scanning legitimately purchased books = fair use.
- (3) Training on that internal corpus = fair use (in this ruling).
- Buying a book does not license redistribution; the model is treated as a new, transformative work unless it reproduces “meaningful chunks” verbatim.
- Some challenge the analogy to human learning and say courts are improperly anthropomorphizing LLMs.
Corporate Power, Double Standards, and Enforcement
- Strong resentment over asymmetry: individuals have been ruined or jailed for relatively small-scale infringement, whereas a heavily funded AI company may face only manageable civil penalties.
- Comparisons to past cases (software piracy sentences, Aaron Swartz, RIAA lawsuits) used to illustrate “one law for the rich, another for everyone else.”
- Others note statutory damages (up to $150k/work) exist but are rarely applied at full theoretical scale; settlement and transaction costs dominate.
Impact on Authors and Future Creativity
- One camp: training on unlicensed books and selling AI access economically undercuts authors (especially mid‑ and low‑tier), discouraging future writing and teaching.
- Counter‑camp: many authors already earn little; people write mainly from intrinsic motivation; a single author’s marginal contribution to a trillion‑parameter model is negligible.
- Ongoing tension between “no copyright on knowledge” and protection of specific expressions and markets.
Analogies and Precedents from Other Tech Sectors
- Commenters cite Spotify, YouTube, Crunchyroll, cloud music lockers, and social platforms as past examples of “pirate first, legalize later” growth strategies; others dispute some of these histories as myths.
- Search engines are a frequent analogy: they copy everything, store it, and show snippets, which courts found transformative—LLMs are argued to be similar or, by critics, more directly competitive.
Ethical and Philosophical Fault Lines
- Debate over whether copyright infringement is “stealing,” “theft of service,” or a distinct, lesser category; some argue infringement can be worse than tangible theft, others the opposite.
- Some hope AI pressure will force radical IP reform and expansion of the public domain; others fear AI giants will secure carve‑outs for themselves while IP remains strict for everyone else.
- Disagreement on whether it’s morally acceptable to train on anything one has lawfully obtained, vs. a need for explicit licensing and revenue-sharing.
Destruction of Books and Data Transparency
- Emotional reaction to millions of books being cut up; some see it as cultural loss, others as acceptable if copies aren’t unique and digital preservation occurs.
- Several argue that the real systemic problem is non‑transparent datasets: without knowing exactly what went into training, claims about “fair use,” “zero‑shot,” and originality are impossible to evaluate.