The Tragedy of Google Books (2017)
Digitizing Books & Google’s Role
- Many see it as tragic that a large scanned corpus (tens of millions of books) exists inside Google but is largely inaccessible due to legal constraints.
- Some argue Google was never “tasked” with this; it chose to, originally as part of a vision for a universal digital library and citation-based search.
- Others say such preservation shouldn’t depend on a private company at all, but on public institutions.
Copyright, Orphan Works, and Law Reform
- Strong frustration with long copyright terms; several proposals suggest drastically shorter terms (e.g., 15–50 years from publication, expensive renewals, “use it or lose it”).
- Orphan works are seen as the central failure: no one can legally provide access, so digitized scans sit unused.
- There’s debate over whether an early settlement around orphan works would have created a productive “clearinghouse” and spurred legislation, or entrenched a monopoly.
- Some commenters reject copyright’s legitimacy outright and advocate unrestricted copying of cultural works.
Libraries, Archives, and Access
- HathiTrust is praised as a crucial alternative that exposes more public-domain texts and some in‑copyright works to affiliated researchers; others find it clunky, over‑restricted, and hard to download from.
- Internet Archive is both lauded as invaluable (Wayback, broad access) and criticized for legal overreach (e.g., the pandemic “e‑library”), which some fear has poisoned the well for future reforms.
- Public and academic libraries are portrayed as underfunded, risk‑averse, and often forced into long embargoes or blanket restrictions due to copyright and privacy concerns.
Piracy, Shadow Libraries, and Practical Workarounds
- LibGen, Z‑Library, and Anna’s Archive are widely used to obtain ebooks, including many that are hard or impossible to buy.
- Some see piracy of out‑of‑print or very old works as ethically acceptable; others emphasize still buying physical books or ebooks to support authors.
LLMs and Use of the Corpus
- Multiple comments speculate that the Google Books corpus is or will be used to train large language models; some view this as inevitable, others as clearly unethical or illegal if it leads to verbatim regurgitation.
- There’s discussion of whether and how models can be trained not to reproduce training data; consensus is that perfect guarantees are impossible, though post‑training can reduce memorization.
Technical and Preservation Challenges
- Digitization projects (e.g., national libraries) face issues beyond rights: fragile media, obsolete formats, specialized hardware, and the need to document workflows for future reprocessing.
- Defining “the work” is nontrivial: not just text/audio, but covers, labels, and physical artifacts.
Proposed Alternatives and Futures
- Suggestions include public, globally coordinated digital libraries; distributed peer‑to‑peer archives with tunable copyright “risk levels”; and legal frameworks that force ongoing availability or reversion to the public domain.
- Some are pessimistic that any major reform will happen soon; others see room for smaller, technical and grassroots efforts (personal scanning, open‑sourcing old materials, better citation mapping).