Show HN: PageIndex – Vectorless RAG

Approach & Intuition

  • PageIndex builds a hierarchical tree / table-of-contents over documents, with LLM-generated summaries at each node, then uses an LLM to traverse this tree at query time instead of doing vector similarity search.
  • Several commenters liken it to B+ trees, source graphs, or Monte Carlo tree search / AlphaGo-style exploration.
  • The pitch is “human-like” navigation: simulate how an expert would skim structure, then dive into relevant sections.

Latency, Cost & Scalability

  • Many are concerned this is slower and more expensive than embeddings, since both indexing and retrieval involve LLM calls.
  • The author clarifies: tree-building is done once and can be slow; query-time uses only the tree (no embedding model) and can be efficient for small trees.
  • Skeptics argue it will “scale spectacularly poorly” beyond a few hundred or a small set of documents; proponents counter that hierarchical search is logarithmic and should scale, but admit real benchmarks are needed.

Comparison with Vector RAG & GraphRAG

  • Some say vector DBs are like hash maps, PageIndex like trees: fast approximate lookup vs. structured traversal.
  • Multiple comments note that embeddings often return “semantic vibes” rather than truly relevant passages, especially in dense, domain-specific corpora.
  • GraphRAG is described as powerful but with extremely expensive, non-linear preprocessing; PageIndex instead shifts cost to per-query dynamic exploration.
  • Others argue vectors will remain useful as a first-pass filter, with tree/agentic retrieval doing deeper reasoning.

Accuracy, “Vibe” Retrieval & Reasoning

  • There’s disagreement over the claim that this is “less vibe-y” than vectors: critics point out it still depends heavily on LLM judgment and LLM-generated structure, just in a different form.
  • Supporters emphasize scenarios where conceptual similarity fails: cross-document inconsistencies, time-scoped questions, or locating exact quotations.
  • Several see value in being able to spend more compute for predictably better answers, especially in high-stakes or offline workflows.

Hybrid & Alternative Strategies

  • Ideas raised:
    • Use vectors on node summaries to guide tree search.
    • Invert RAG: generate likely questions or “tiny overview” summaries at ingest time, then use BM25/keyword search plus LLM re-ranking.
    • Let LLMs generate SQL/regex queries instead of vectors.
    • Combine vector ANN for a wide shortlist with LLM or cross-encoder re-ranking.

Use Cases & Limits

  • Widely viewed as promising for: single documents or small corpora, complex long reports, offline/background processing, and high-accuracy domains (legal, medical, diagnostics).
  • Seen as a poor fit for: massive multi-million-document corpora, real-time chat-style RAG where low latency and cheap inference dominate.

Ingestion, OCR & Structure Extraction

  • A side thread discusses that PageIndex depends heavily on high-quality structure (headings, sections), making PDF/HTML-to-markdown and OCR quality critical.
  • Various tools and pipelines are mentioned; some highlight the need for “document layout analysis” rather than simple OCR.
  • PageIndex-associated OCR and HTML support are mentioned, but commenters request formal benchmarks.

Benchmarks & Skepticism

  • Several commenters are uneasy about the lack of broad, standard RAG benchmark results and worry the showcased FinanceBench gains may rely on weak baselines.
  • Others note that LLM-based indexing and retrieval introduce more tuning knobs (prompts, structure, chunking) and may increase iteration cost vs. embeddings.
  • Overall sentiment: conceptually interesting and likely useful in niches, but needs rigorous large-scale, apples-to-apples evaluations against well-tuned hybrid vector systems.