So you wanna build a local RAG?

Semantic chunking and document structure

  • Several comments stress that embedding whole documents hurts performance; semantic chunking plus added context about each chunk’s role in the doc can dramatically improve retrieval.
  • Anthropic-style “contextual retrieval” (generate summaries/metadata around chunks) is cited as particularly effective.
  • Some wonder if GraphRAG / knowledge-graph approaches could better capture cross-document structure, similar to personal knowledge tools.

Lexical vs semantic search (vectors) debate

  • One camp argues you can skip vector DBs: full-text search (BM25, SQLite FTS, grep/rg, TypeSense, Elasticsearch) plus an LLM-driven query loop often works “well enough,” is cheaper, simpler, and avoids chunking issues.
  • Others report that pure lexical search degrades recall, especially when users don’t know exact terminology; multiple search iterations inflate latency.
  • A common framing: lexical search gives high precision / lower recall; semantic search gives higher recall / lower precision.
  • Many advocate hybrid systems (BM25 + embeddings, fusion, reranking) as the current best practice, though added engineering complexity is questioned.

Evaluation and real-world usage

  • Strong emphasis on proper evals: build test sets of [query, correct answer], use synthetic Q&A, and compare BM25 vs vector vs hybrid configurations.
  • People note dev-created queries are biased (they “know the docs”); real users phrase things differently, revealing much poorer performance.
  • Some propose automated pipelines and LLM judges to continuously score RAG changes.

Local models and infra

  • One view: running a local LLM is overkill; keeping only docs and vector DB local is already a big win.
  • Others say consumer hardware (16–32GB GPUs or high-RAM laptops) can run substantial local models and that medium-sized orgs can self-host if they value privacy.

Practical challenges and tooling

  • Document parsing (especially PDFs with tables, images, multi-page layouts, footnotes) is described as a major unsolved pain point, often more limiting than retrieval method.
  • Various tools/stacks are mentioned: llama.cpp-based apps, Elasticsearch, Chroma, sqlite-vec, local RAG GUIs, Nextcloud + MCP, and open-source RAG frameworks with benchmarks.
  • Some highlight language issues: many embedding models are English-only; multilingual or language-specific models and leaderboards are needed for non-English RAG.