2024-10-22

The Tragedy of Google Books (2017)

Digitizing Books & Google’s Role

Many see it as tragic that a large scanned corpus (tens of millions of books) exists inside Google but is largely inaccessible due to legal constraints.
Some argue Google was never “tasked” with this; it chose to, originally as part of a vision for a universal digital library and citation-based search.
Others say such preservation shouldn’t depend on a private company at all, but on public institutions.

Copyright, Orphan Works, and Law Reform

Strong frustration with long copyright terms; several proposals suggest drastically shorter terms (e.g., 15–50 years from publication, expensive renewals, “use it or lose it”).
Orphan works are seen as the central failure: no one can legally provide access, so digitized scans sit unused.
There’s debate over whether an early settlement around orphan works would have created a productive “clearinghouse” and spurred legislation, or entrenched a monopoly.
Some commenters reject copyright’s legitimacy outright and advocate unrestricted copying of cultural works.

Libraries, Archives, and Access

HathiTrust is praised as a crucial alternative that exposes more public-domain texts and some in‑copyright works to affiliated researchers; others find it clunky, over‑restricted, and hard to download from.
Internet Archive is both lauded as invaluable (Wayback, broad access) and criticized for legal overreach (e.g., the pandemic “e‑library”), which some fear has poisoned the well for future reforms.
Public and academic libraries are portrayed as underfunded, risk‑averse, and often forced into long embargoes or blanket restrictions due to copyright and privacy concerns.

Piracy, Shadow Libraries, and Practical Workarounds

LibGen, Z‑Library, and Anna’s Archive are widely used to obtain ebooks, including many that are hard or impossible to buy.
Some see piracy of out‑of‑print or very old works as ethically acceptable; others emphasize still buying physical books or ebooks to support authors.

LLMs and Use of the Corpus

Multiple comments speculate that the Google Books corpus is or will be used to train large language models; some view this as inevitable, others as clearly unethical or illegal if it leads to verbatim regurgitation.
There’s discussion of whether and how models can be trained not to reproduce training data; consensus is that perfect guarantees are impossible, though post‑training can reduce memorization.

Technical and Preservation Challenges

Digitization projects (e.g., national libraries) face issues beyond rights: fragile media, obsolete formats, specialized hardware, and the need to document workflows for future reprocessing.
Defining “the work” is nontrivial: not just text/audio, but covers, labels, and physical artifacts.

Proposed Alternatives and Futures

Suggestions include public, globally coordinated digital libraries; distributed peer‑to‑peer archives with tunable copyright “risk levels”; and legal frameworks that force ongoing availability or reversion to the public domain.
Some are pessimistic that any major reform will happen soon; others see room for smaller, technical and grassroots efforts (personal scanning, open‑sourcing old materials, better citation mapping).

Related topics