Production RAG: what I learned from processing 5M+ documents

Chunking and Document Processing

  • Many commenters agree chunking is a major pain point and the main source of effort in production RAG.
  • Some use LLMs (e.g., Anthropic-style contextual retrieval) to summarize large texts and derive semantically meaningful chunks, including per-chunk summaries embedded alongside raw text.
  • Several people note the public repo for the article’s product doesn’t actually expose the real chunker, only chunk data models; there’s curiosity about the concrete strategies used.
  • There’s interest in more detail on what “processing 5M docs” actually entailed and how chunking differed by use case.

Reranking vs Plain Embedding Search

  • Rerankers are repeatedly called out as a high-leverage addition: small, finetuned models that reorder top-k vector hits by relevance to the query.
  • They’re described as “what you wanted cross-encoders to be”: more accurate than cosine similarity alone but cheaper and faster than an extra full LLM call.
  • Explanations emphasize: embeddings measure “looks like the question,” rerankers measure “looks like an answer.”
  • Typical pattern: vector search → top N (e.g., 50) → reranker → top M (e.g., 15). Some suggest also letting a general LLM rerank when latency and cost allow.

Query Generation, Hybrid & Agentic Retrieval

  • Synthetic query generation/expansion is widely endorsed for fixing poor user queries; some generate multiple variants, search in parallel, and fuse results (e.g., reciprocal rank fusion).
  • Best-practice stacks often combine dense vectors + BM25 and a reranker; embeddings alone are seen as inadequate, especially for technical terms.
  • Several comments advocate “agentic RAG”: giving the LLM search tools, letting it reformulate queries, do multiple rounds of search, and mix different tools and indices.
  • There’s disagreement on how reliably current LLMs use tools and on latency tradeoffs; some systems are async and accept slower, deeper research.

Embedding Models and Vector Stores

  • Multiple commenters are surprised the article didn’t explore more embedding models, noting newer open and commercial models often outperform OpenAI’s.
  • Alternatives mentioned include Qwen3 embeddings, Gemini embeddings, Voyage, mixedbread and models ranked on newer leaderboards.
  • Vector store choice is debated: S3 Vectors is praised for simplicity and cost but critiqued for higher latency and lack of sparse/keyword support; others stress picking stores that support metadata filtering and hybrid search.

UX, Evaluation, and Deployment Concerns

  • Practitioners emphasize search-oriented UIs and making context/control visible, rather than opaque chat, to align user expectations.
  • Metadata “injection” (titles, authors, timestamps, versions) alongside chunks is seen as important for filtering and grounding.
  • Some ask how systems are evaluated (frameworks vs custom metrics) and whether performance yields real process-efficiency gains.
  • There’s debate over what “self-hosted” means when many “self-hosted” stacks still require multiple third-party cloud services.
  • One notable operational finding: GPT‑5 reportedly underperformed GPT‑4.1 in this RAG setting with large contexts (worse instruction following, overly long answers, tighter context window), leading the author back to GPT‑4.1.