Meta torrented & seeded 81.7 TB dataset containing copyrighted data
Dataset scale and sourcing
- Commenters estimate 81.7 TB corresponds to many millions of books; some link it to LibGen/Sci-Hub torrents via Anna’s Archive (~90M files, ~1 MB each).
- Debate over file sizes: plain-text ebooks vs large scanned PDFs with illustrations, charts, and image-only pages.
- Some note Meta’s own LLaMA paper already acknowledges using Books3 (derived from a private tracker dump) plus other copyrighted corpora.
- Internal messages quoted in the article show staff worried about torrenting from corporate IPs and configuring clients to “seed as little as possible,” prompting jokes about Meta being “leechers.”
Legality: training vs distribution
- Thread distinguishes two issues:
- Training on copyrighted works (still a largely unresolved legal question; often argued as potential “fair use”).
- Downloading and seeding pirated torrents (clearly infringing distribution regardless of AI context).
- Several calculate statutory damages (US minimum $750, up to $150k per work for willful infringement) and note that, at Meta’s scale, the theoretical numbers reach into the trillions—far beyond any realistic judgment.
- Some expect a modest fine or settlement; others predict courts may effectively legalize large-scale training on unlicensed data to avoid “crippling” US AI competitiveness.
Two-tier justice and Aaron Swartz
- Strong theme: contrast between Meta’s mass infringement and harsh treatment of individuals (Swartz, Megaupload, small-time torrenters).
- Many see evidence of a “two-tier” legal system where corporations with lawyers and lobbyists get settlements, while individuals face ruinous penalties or prosecution.
- Swartz’s case is revisited in detail; some stress prosecutorial overreach, others add nuance about plea deals, but most see the comparison as highlighting double standards.
Copyright, piracy, and precedent
- Multiple historical analogies: YouTube’s early TV uploads, Google’s web indexing and book scanning, Spotify/Crunchyroll bootstrapping on pirated catalogs, Uber/Airbnb ignoring regulations.
- Divided views:
- One camp wants stricter, evenly enforced IP laws and corporate accountability.
- Another argues current copyright (life+70, etc.) is “insane,” largely benefits big intermediaries, and should be radically shortened or abolished.
- LibGen/Anna’s Archive are described by many as a “civilizational project”; preserving and democratizing access is seen as good, but monetizing that corpus via proprietary AI is more contentious.
Ethics of Meta and AI companies
- Some focus on alleged deception: internal references to “stealth mode,” minimizing seeding, and potential misstatements in depositions (including by leadership).
- Others argue Meta is relatively better than closed competitors because LLaMA weights are public, framing this as “software communism” versus OpenAI/Google’s proprietary models.
- A recurring proposal: if a model is trained on unlicensed copyrighted data, its weights should be forced into the public domain or non-commercial-only use.
Broader structural critiques
- Many tie this to a pattern where VC-backed firms “move fast and break laws,” then normalize their position via lobbying and settlements.
- Several advocate either strong antitrust and IP enforcement against large firms, or—at the other extreme—using corporate overreach as leverage to dismantle or radically reform copyright itself.