Ask HN: Has anybody built search on top of Anna's Archive?
Scope and Feasibility of Full‑Text Search on Anna’s Archive
- Several commenters note that AA already has rich metadata search; the question is about full-text and possibly page-level search.
- Rough estimates: AA’s ~1 PB could become 10–20 TB of plaintext; indexing would further multiply storage but is still feasible on commodity hardware.
- Main technical bottlenecks:
- Reliably extracting clean text from heterogeneous formats (especially scanned PDFs).
- Handling OCR artifacts, hyphenation, footnotes, layout quirks.
- Choosing a search backend that can handle the scale (Lucene/Tantivy seen as more realistic than Meilisearch; SQLite+WASM for client-side experiments).
- Ideas include partial indexing (e.g., top 100k books first), static-hosted indexes fetched by hash, and TF‑IDF–style per-term shards.
Deduplication, Editions, and Result Quality
- Simple ISBN-based dedup is inadequate: many editions per ISBN family, multiple ISBNs per work, retitled collections, etc.
- Alternatives suggested: Library of Congress or Dewey classifications plus author/title/edition; or content-based dedup.
- Users want one canonical result per work, with optional edition drill‑down and weighting by quality; also the possibility of indexing a “plain” version but serving a nicer EPUB/PDF.
Use Cases and Value
- Proposed beneficiaries:
- Researchers in fields heavily dependent on older books and paywalled PDFs.
- People wanting direct access to exact passages instead of LLM paraphrases.
- Niche projects like curated “canons” (e.g., frequently cited HN books) optimized for semantic/LLM search.
- Some see it as “game‑changing” for scholarship and knowledge access; others question who would use it versus Google Books, Amazon, or Goodreads and how it would be funded.
Legal and Policy Risks
- Core concern: indexing and exposing full text of largely pirated material may be treated like facilitating infringement (Pirate Bay analogy), even without hosting files.
- Distinction drawn between:
- LLM training as a possibly transformative, in‑house use.
- A public engine that enables verbatim retrieval and points users to shadow libraries.
- Some argue fair use precedent around Google Books; others note that AA’s sources are outright unauthorized, which makes the situation riskier.
- Several commenters conclude the project is non‑monetizable, high‑risk legally, and thus unlikely to be publicly deployed, though individuals could build private indexes.
Existing Partial Solutions
- Z‑Library reportedly offers limited full-text search but at smaller scale.
- Various book search tools and an AA competition exist, mostly around metadata/ISBN, not full text of all books.
- Android apps and external search tricks (e.g.,
site:annas-archive.org) provide practical but shallow search.
LLMs and Double Standards
- Widely shared belief that major LLMs (Meta, others) have already ingested AA/related datasets; Meta’s torrenting of AA is cited.
- Several comments highlight perceived double standards: individuals and small sites face harsh copyright enforcement, while large corporations push legal boundaries with relative impunity.
Illegal Non‑Copyright Content
- Some worry that bulk-downloading AA might incidentally pull in non‑copyright criminal content (e.g., sexual exploitation material or bomb manuals).
- Opinions differ on how much such material is present and how laws in different jurisdictions treat text vs. images, or instructional content.
- This contributes to hesitancy about mirroring or seeding large chunks of the archive.