2025-06-04

Ask HN: Has anybody built search on top of Anna's Archive?

Scope and Feasibility of Full‑Text Search on Anna’s Archive

Several commenters note that AA already has rich metadata search; the question is about full-text and possibly page-level search.
Rough estimates: AA’s ~1 PB could become 10–20 TB of plaintext; indexing would further multiply storage but is still feasible on commodity hardware.
Main technical bottlenecks:
- Reliably extracting clean text from heterogeneous formats (especially scanned PDFs).
- Handling OCR artifacts, hyphenation, footnotes, layout quirks.
- Choosing a search backend that can handle the scale (Lucene/Tantivy seen as more realistic than Meilisearch; SQLite+WASM for client-side experiments).
Ideas include partial indexing (e.g., top 100k books first), static-hosted indexes fetched by hash, and TF‑IDF–style per-term shards.

Deduplication, Editions, and Result Quality

Simple ISBN-based dedup is inadequate: many editions per ISBN family, multiple ISBNs per work, retitled collections, etc.
Alternatives suggested: Library of Congress or Dewey classifications plus author/title/edition; or content-based dedup.
Users want one canonical result per work, with optional edition drill‑down and weighting by quality; also the possibility of indexing a “plain” version but serving a nicer EPUB/PDF.

Use Cases and Value

Proposed beneficiaries:
- Researchers in fields heavily dependent on older books and paywalled PDFs.
- People wanting direct access to exact passages instead of LLM paraphrases.
- Niche projects like curated “canons” (e.g., frequently cited HN books) optimized for semantic/LLM search.
Some see it as “game‑changing” for scholarship and knowledge access; others question who would use it versus Google Books, Amazon, or Goodreads and how it would be funded.

Legal and Policy Risks

Core concern: indexing and exposing full text of largely pirated material may be treated like facilitating infringement (Pirate Bay analogy), even without hosting files.
Distinction drawn between:
- LLM training as a possibly transformative, in‑house use.
- A public engine that enables verbatim retrieval and points users to shadow libraries.
Some argue fair use precedent around Google Books; others note that AA’s sources are outright unauthorized, which makes the situation riskier.
Several commenters conclude the project is non‑monetizable, high‑risk legally, and thus unlikely to be publicly deployed, though individuals could build private indexes.

Existing Partial Solutions

Z‑Library reportedly offers limited full-text search but at smaller scale.
Various book search tools and an AA competition exist, mostly around metadata/ISBN, not full text of all books.
Android apps and external search tricks (e.g., site:annas-archive.org) provide practical but shallow search.

LLMs and Double Standards

Widely shared belief that major LLMs (Meta, others) have already ingested AA/related datasets; Meta’s torrenting of AA is cited.
Several comments highlight perceived double standards: individuals and small sites face harsh copyright enforcement, while large corporations push legal boundaries with relative impunity.

Illegal Non‑Copyright Content

Some worry that bulk-downloading AA might incidentally pull in non‑copyright criminal content (e.g., sexual exploitation material or bomb manuals).
Opinions differ on how much such material is present and how laws in different jurisdictions treat text vs. images, or instructional content.
This contributes to hesitancy about mirroring or seeding large chunks of the archive.

Related topics