Meta torrented & seeded 81.7 TB dataset containing copyrighted data

Dataset scale and sourcing

  • Commenters estimate 81.7 TB corresponds to many millions of books; some link it to LibGen/Sci-Hub torrents via Anna’s Archive (~90M files, ~1 MB each).
  • Debate over file sizes: plain-text ebooks vs large scanned PDFs with illustrations, charts, and image-only pages.
  • Some note Meta’s own LLaMA paper already acknowledges using Books3 (derived from a private tracker dump) plus other copyrighted corpora.
  • Internal messages quoted in the article show staff worried about torrenting from corporate IPs and configuring clients to “seed as little as possible,” prompting jokes about Meta being “leechers.”

Legality: training vs distribution

  • Thread distinguishes two issues:
    • Training on copyrighted works (still a largely unresolved legal question; often argued as potential “fair use”).
    • Downloading and seeding pirated torrents (clearly infringing distribution regardless of AI context).
  • Several calculate statutory damages (US minimum $750, up to $150k per work for willful infringement) and note that, at Meta’s scale, the theoretical numbers reach into the trillions—far beyond any realistic judgment.
  • Some expect a modest fine or settlement; others predict courts may effectively legalize large-scale training on unlicensed data to avoid “crippling” US AI competitiveness.

Two-tier justice and Aaron Swartz

  • Strong theme: contrast between Meta’s mass infringement and harsh treatment of individuals (Swartz, Megaupload, small-time torrenters).
  • Many see evidence of a “two-tier” legal system where corporations with lawyers and lobbyists get settlements, while individuals face ruinous penalties or prosecution.
  • Swartz’s case is revisited in detail; some stress prosecutorial overreach, others add nuance about plea deals, but most see the comparison as highlighting double standards.

Copyright, piracy, and precedent

  • Multiple historical analogies: YouTube’s early TV uploads, Google’s web indexing and book scanning, Spotify/Crunchyroll bootstrapping on pirated catalogs, Uber/Airbnb ignoring regulations.
  • Divided views:
    • One camp wants stricter, evenly enforced IP laws and corporate accountability.
    • Another argues current copyright (life+70, etc.) is “insane,” largely benefits big intermediaries, and should be radically shortened or abolished.
  • LibGen/Anna’s Archive are described by many as a “civilizational project”; preserving and democratizing access is seen as good, but monetizing that corpus via proprietary AI is more contentious.

Ethics of Meta and AI companies

  • Some focus on alleged deception: internal references to “stealth mode,” minimizing seeding, and potential misstatements in depositions (including by leadership).
  • Others argue Meta is relatively better than closed competitors because LLaMA weights are public, framing this as “software communism” versus OpenAI/Google’s proprietary models.
  • A recurring proposal: if a model is trained on unlicensed copyrighted data, its weights should be forced into the public domain or non-commercial-only use.

Broader structural critiques

  • Many tie this to a pattern where VC-backed firms “move fast and break laws,” then normalize their position via lobbying and settlements.
  • Several advocate either strong antitrust and IP enforcement against large firms, or—at the other extreme—using corporate overreach as leverage to dismantle or radically reform copyright itself.