2025-10-20

Production RAG: what I learned from processing 5M+ documents

Chunking and Document Processing

Many commenters agree chunking is a major pain point and the main source of effort in production RAG.
Some use LLMs (e.g., Anthropic-style contextual retrieval) to summarize large texts and derive semantically meaningful chunks, including per-chunk summaries embedded alongside raw text.
Several people note the public repo for the article’s product doesn’t actually expose the real chunker, only chunk data models; there’s curiosity about the concrete strategies used.
There’s interest in more detail on what “processing 5M docs” actually entailed and how chunking differed by use case.

Reranking vs Plain Embedding Search

Rerankers are repeatedly called out as a high-leverage addition: small, finetuned models that reorder top-k vector hits by relevance to the query.
They’re described as “what you wanted cross-encoders to be”: more accurate than cosine similarity alone but cheaper and faster than an extra full LLM call.
Explanations emphasize: embeddings measure “looks like the question,” rerankers measure “looks like an answer.”
Typical pattern: vector search → top N (e.g., 50) → reranker → top M (e.g., 15). Some suggest also letting a general LLM rerank when latency and cost allow.

Query Generation, Hybrid & Agentic Retrieval

Synthetic query generation/expansion is widely endorsed for fixing poor user queries; some generate multiple variants, search in parallel, and fuse results (e.g., reciprocal rank fusion).
Best-practice stacks often combine dense vectors + BM25 and a reranker; embeddings alone are seen as inadequate, especially for technical terms.
Several comments advocate “agentic RAG”: giving the LLM search tools, letting it reformulate queries, do multiple rounds of search, and mix different tools and indices.
There’s disagreement on how reliably current LLMs use tools and on latency tradeoffs; some systems are async and accept slower, deeper research.

Embedding Models and Vector Stores

Multiple commenters are surprised the article didn’t explore more embedding models, noting newer open and commercial models often outperform OpenAI’s.
Alternatives mentioned include Qwen3 embeddings, Gemini embeddings, Voyage, mixedbread and models ranked on newer leaderboards.
Vector store choice is debated: S3 Vectors is praised for simplicity and cost but critiqued for higher latency and lack of sparse/keyword support; others stress picking stores that support metadata filtering and hybrid search.

UX, Evaluation, and Deployment Concerns

Practitioners emphasize search-oriented UIs and making context/control visible, rather than opaque chat, to align user expectations.
Metadata “injection” (titles, authors, timestamps, versions) alongside chunks is seen as important for filtering and grounding.
Some ask how systems are evaluated (frameworks vs custom metrics) and whether performance yields real process-efficiency gains.
There’s debate over what “self-hosted” means when many “self-hosted” stacks still require multiple third-party cloud services.
One notable operational finding: GPT‑5 reportedly underperformed GPT‑4.1 in this RAG setting with large contexts (worse instruction following, overly long answers, tighter context window), leading the author back to GPT‑4.1.

Related topics