So you wanna build a local RAG?
Semantic chunking and document structure
- Several comments stress that embedding whole documents hurts performance; semantic chunking plus added context about each chunk’s role in the doc can dramatically improve retrieval.
- Anthropic-style “contextual retrieval” (generate summaries/metadata around chunks) is cited as particularly effective.
- Some wonder if GraphRAG / knowledge-graph approaches could better capture cross-document structure, similar to personal knowledge tools.
Lexical vs semantic search (vectors) debate
- One camp argues you can skip vector DBs: full-text search (BM25, SQLite FTS, grep/rg, TypeSense, Elasticsearch) plus an LLM-driven query loop often works “well enough,” is cheaper, simpler, and avoids chunking issues.
- Others report that pure lexical search degrades recall, especially when users don’t know exact terminology; multiple search iterations inflate latency.
- A common framing: lexical search gives high precision / lower recall; semantic search gives higher recall / lower precision.
- Many advocate hybrid systems (BM25 + embeddings, fusion, reranking) as the current best practice, though added engineering complexity is questioned.
Evaluation and real-world usage
- Strong emphasis on proper evals: build test sets of [query, correct answer], use synthetic Q&A, and compare BM25 vs vector vs hybrid configurations.
- People note dev-created queries are biased (they “know the docs”); real users phrase things differently, revealing much poorer performance.
- Some propose automated pipelines and LLM judges to continuously score RAG changes.
Local models and infra
- One view: running a local LLM is overkill; keeping only docs and vector DB local is already a big win.
- Others say consumer hardware (16–32GB GPUs or high-RAM laptops) can run substantial local models and that medium-sized orgs can self-host if they value privacy.
Practical challenges and tooling
- Document parsing (especially PDFs with tables, images, multi-page layouts, footnotes) is described as a major unsolved pain point, often more limiting than retrieval method.
- Various tools/stacks are mentioned: llama.cpp-based apps, Elasticsearch, Chroma, sqlite-vec, local RAG GUIs, Nextcloud + MCP, and open-source RAG frameworks with benchmarks.
- Some highlight language issues: many embedding models are English-only; multilingual or language-specific models and leaderboards are needed for non-English RAG.