Ask HN: Has anybody built search on top of Anna's Archive?

Scope and Feasibility of Full‑Text Search on Anna’s Archive

  • Several commenters note that AA already has rich metadata search; the question is about full-text and possibly page-level search.
  • Rough estimates: AA’s ~1 PB could become 10–20 TB of plaintext; indexing would further multiply storage but is still feasible on commodity hardware.
  • Main technical bottlenecks:
    • Reliably extracting clean text from heterogeneous formats (especially scanned PDFs).
    • Handling OCR artifacts, hyphenation, footnotes, layout quirks.
    • Choosing a search backend that can handle the scale (Lucene/Tantivy seen as more realistic than Meilisearch; SQLite+WASM for client-side experiments).
  • Ideas include partial indexing (e.g., top 100k books first), static-hosted indexes fetched by hash, and TF‑IDF–style per-term shards.

Deduplication, Editions, and Result Quality

  • Simple ISBN-based dedup is inadequate: many editions per ISBN family, multiple ISBNs per work, retitled collections, etc.
  • Alternatives suggested: Library of Congress or Dewey classifications plus author/title/edition; or content-based dedup.
  • Users want one canonical result per work, with optional edition drill‑down and weighting by quality; also the possibility of indexing a “plain” version but serving a nicer EPUB/PDF.

Use Cases and Value

  • Proposed beneficiaries:
    • Researchers in fields heavily dependent on older books and paywalled PDFs.
    • People wanting direct access to exact passages instead of LLM paraphrases.
    • Niche projects like curated “canons” (e.g., frequently cited HN books) optimized for semantic/LLM search.
  • Some see it as “game‑changing” for scholarship and knowledge access; others question who would use it versus Google Books, Amazon, or Goodreads and how it would be funded.

Legal and Policy Risks

  • Core concern: indexing and exposing full text of largely pirated material may be treated like facilitating infringement (Pirate Bay analogy), even without hosting files.
  • Distinction drawn between:
    • LLM training as a possibly transformative, in‑house use.
    • A public engine that enables verbatim retrieval and points users to shadow libraries.
  • Some argue fair use precedent around Google Books; others note that AA’s sources are outright unauthorized, which makes the situation riskier.
  • Several commenters conclude the project is non‑monetizable, high‑risk legally, and thus unlikely to be publicly deployed, though individuals could build private indexes.

Existing Partial Solutions

  • Z‑Library reportedly offers limited full-text search but at smaller scale.
  • Various book search tools and an AA competition exist, mostly around metadata/ISBN, not full text of all books.
  • Android apps and external search tricks (e.g., site:annas-archive.org) provide practical but shallow search.

LLMs and Double Standards

  • Widely shared belief that major LLMs (Meta, others) have already ingested AA/related datasets; Meta’s torrenting of AA is cited.
  • Several comments highlight perceived double standards: individuals and small sites face harsh copyright enforcement, while large corporations push legal boundaries with relative impunity.

Illegal Non‑Copyright Content

  • Some worry that bulk-downloading AA might incidentally pull in non‑copyright criminal content (e.g., sexual exploitation material or bomb manuals).
  • Opinions differ on how much such material is present and how laws in different jurisdictions treat text vs. images, or instructional content.
  • This contributes to hesitancy about mirroring or seeding large chunks of the archive.