Show HN: PageIndex – Vectorless RAG
Approach & Intuition
- PageIndex builds a hierarchical tree / table-of-contents over documents, with LLM-generated summaries at each node, then uses an LLM to traverse this tree at query time instead of doing vector similarity search.
- Several commenters liken it to B+ trees, source graphs, or Monte Carlo tree search / AlphaGo-style exploration.
- The pitch is “human-like” navigation: simulate how an expert would skim structure, then dive into relevant sections.
Latency, Cost & Scalability
- Many are concerned this is slower and more expensive than embeddings, since both indexing and retrieval involve LLM calls.
- The author clarifies: tree-building is done once and can be slow; query-time uses only the tree (no embedding model) and can be efficient for small trees.
- Skeptics argue it will “scale spectacularly poorly” beyond a few hundred or a small set of documents; proponents counter that hierarchical search is logarithmic and should scale, but admit real benchmarks are needed.
Comparison with Vector RAG & GraphRAG
- Some say vector DBs are like hash maps, PageIndex like trees: fast approximate lookup vs. structured traversal.
- Multiple comments note that embeddings often return “semantic vibes” rather than truly relevant passages, especially in dense, domain-specific corpora.
- GraphRAG is described as powerful but with extremely expensive, non-linear preprocessing; PageIndex instead shifts cost to per-query dynamic exploration.
- Others argue vectors will remain useful as a first-pass filter, with tree/agentic retrieval doing deeper reasoning.
Accuracy, “Vibe” Retrieval & Reasoning
- There’s disagreement over the claim that this is “less vibe-y” than vectors: critics point out it still depends heavily on LLM judgment and LLM-generated structure, just in a different form.
- Supporters emphasize scenarios where conceptual similarity fails: cross-document inconsistencies, time-scoped questions, or locating exact quotations.
- Several see value in being able to spend more compute for predictably better answers, especially in high-stakes or offline workflows.
Hybrid & Alternative Strategies
- Ideas raised:
- Use vectors on node summaries to guide tree search.
- Invert RAG: generate likely questions or “tiny overview” summaries at ingest time, then use BM25/keyword search plus LLM re-ranking.
- Let LLMs generate SQL/regex queries instead of vectors.
- Combine vector ANN for a wide shortlist with LLM or cross-encoder re-ranking.
Use Cases & Limits
- Widely viewed as promising for: single documents or small corpora, complex long reports, offline/background processing, and high-accuracy domains (legal, medical, diagnostics).
- Seen as a poor fit for: massive multi-million-document corpora, real-time chat-style RAG where low latency and cheap inference dominate.
Ingestion, OCR & Structure Extraction
- A side thread discusses that PageIndex depends heavily on high-quality structure (headings, sections), making PDF/HTML-to-markdown and OCR quality critical.
- Various tools and pipelines are mentioned; some highlight the need for “document layout analysis” rather than simple OCR.
- PageIndex-associated OCR and HTML support are mentioned, but commenters request formal benchmarks.
Benchmarks & Skepticism
- Several commenters are uneasy about the lack of broad, standard RAG benchmark results and worry the showcased FinanceBench gains may rely on weak baselines.
- Others note that LLM-based indexing and retrieval introduce more tuning knobs (prompts, structure, chunking) and may increase iteration cost vs. embeddings.
- Overall sentiment: conceptually interesting and likely useful in niches, but needs rigorous large-scale, apples-to-apples evaluations against well-tuned hybrid vector systems.