2025-08-27

Show HN: PageIndex – Vectorless RAG

Approach & Intuition

PageIndex builds a hierarchical tree / table-of-contents over documents, with LLM-generated summaries at each node, then uses an LLM to traverse this tree at query time instead of doing vector similarity search.
Several commenters liken it to B+ trees, source graphs, or Monte Carlo tree search / AlphaGo-style exploration.
The pitch is “human-like” navigation: simulate how an expert would skim structure, then dive into relevant sections.

Latency, Cost & Scalability

Many are concerned this is slower and more expensive than embeddings, since both indexing and retrieval involve LLM calls.
The author clarifies: tree-building is done once and can be slow; query-time uses only the tree (no embedding model) and can be efficient for small trees.
Skeptics argue it will “scale spectacularly poorly” beyond a few hundred or a small set of documents; proponents counter that hierarchical search is logarithmic and should scale, but admit real benchmarks are needed.

Comparison with Vector RAG & GraphRAG

Some say vector DBs are like hash maps, PageIndex like trees: fast approximate lookup vs. structured traversal.
Multiple comments note that embeddings often return “semantic vibes” rather than truly relevant passages, especially in dense, domain-specific corpora.
GraphRAG is described as powerful but with extremely expensive, non-linear preprocessing; PageIndex instead shifts cost to per-query dynamic exploration.
Others argue vectors will remain useful as a first-pass filter, with tree/agentic retrieval doing deeper reasoning.

Accuracy, “Vibe” Retrieval & Reasoning

There’s disagreement over the claim that this is “less vibe-y” than vectors: critics point out it still depends heavily on LLM judgment and LLM-generated structure, just in a different form.
Supporters emphasize scenarios where conceptual similarity fails: cross-document inconsistencies, time-scoped questions, or locating exact quotations.
Several see value in being able to spend more compute for predictably better answers, especially in high-stakes or offline workflows.

Hybrid & Alternative Strategies

Ideas raised:
- Use vectors on node summaries to guide tree search.
- Invert RAG: generate likely questions or “tiny overview” summaries at ingest time, then use BM25/keyword search plus LLM re-ranking.
- Let LLMs generate SQL/regex queries instead of vectors.
- Combine vector ANN for a wide shortlist with LLM or cross-encoder re-ranking.

Use Cases & Limits

Widely viewed as promising for: single documents or small corpora, complex long reports, offline/background processing, and high-accuracy domains (legal, medical, diagnostics).
Seen as a poor fit for: massive multi-million-document corpora, real-time chat-style RAG where low latency and cheap inference dominate.

Ingestion, OCR & Structure Extraction

A side thread discusses that PageIndex depends heavily on high-quality structure (headings, sections), making PDF/HTML-to-markdown and OCR quality critical.
Various tools and pipelines are mentioned; some highlight the need for “document layout analysis” rather than simple OCR.
PageIndex-associated OCR and HTML support are mentioned, but commenters request formal benchmarks.

Benchmarks & Skepticism

Several commenters are uneasy about the lack of broad, standard RAG benchmark results and worry the showcased FinanceBench gains may rely on weak baselines.
Others note that LLM-based indexing and retrieval introduce more tuning knobs (prompts, structure, chunking) and may increase iteration cost vs. embeddings.
Overall sentiment: conceptually interesting and likely useful in niches, but needs rigorous large-scale, apples-to-apples evaluations against well-tuned hybrid vector systems.

Related topics