It sure looks like Meta stole a lot of books to build its AI
Legality, Fair Use, and How Data Was Obtained
- Major thread on whether training on pirated books is more legally problematic than training on legitimately scanned books (e.g., library scans in prior cases).
- One side: training is “reading,” highly transformative, and thus fair use regardless of being done by a machine or for profit; courts have previously allowed automated uses like thumbnails and snippet search.
- Other side: how the works were obtained (torrenting, known-infringing shadow libraries) should matter; courts may view willful acquisition of illegal copies as aggravating, possibly justifying punitive damages.
- Dispute over whether using AI to regurgitate training text crosses into infringement versus merely “learning” from it.
Human vs AI Learning Analogy
- Pro-AI camp argues AI should be allowed to learn from texts like humans do, without per-work licensing, and that “machines vs humans” shouldn’t change fair use analysis.
- Critics stress scale, automation, and corporate control: a for-profit system ingesting millions of works is not ethically or legally equivalent to an individual reading.
- Some note that humans can’t practically reproduce entire books verbatim but models sometimes can, which weakens the analogy.
Copyright, Morality, and Compensation
- Many participants see the current copyright regime as overgrown yet still necessary to support writers and artists.
- Some want this case to weaken copyright generally; others want it to curb “IP laundering” by large firms while preserving protection for individual creators.
- Proposals floated: book-purchase requirements, library-like access rules, streaming-style collective compensation; none detailed.
- Disagreement on damages: some think, at most, models should pay the retail price per book; others argue that would incentivize systematic “stealing” and must be penalized more harshly.
Assessment of Meta and Article Tone
- Several comments criticize Meta’s pattern of “copy first, ask later,” seeing this case as continuation of past behavior.
- Others call the article emotionally charged or biased, especially early references to old scandals and political framing.
Scraping, Site Defenses, and Data Poisoning
- Discussion of anti-selection JavaScript and paywalls as weak defenses against sophisticated scrapers.
- Ideas raised about adversarial content or “poisoned” pages to corrupt AI training data, compared humorously to SEO.