Production RAG: what I learned from processing 5M+ documents
Chunking and Document Processing
- Many commenters agree chunking is a major pain point and the main source of effort in production RAG.
- Some use LLMs (e.g., Anthropic-style contextual retrieval) to summarize large texts and derive semantically meaningful chunks, including per-chunk summaries embedded alongside raw text.
- Several people note the public repo for the article’s product doesn’t actually expose the real chunker, only chunk data models; there’s curiosity about the concrete strategies used.
- There’s interest in more detail on what “processing 5M docs” actually entailed and how chunking differed by use case.
Reranking vs Plain Embedding Search
- Rerankers are repeatedly called out as a high-leverage addition: small, finetuned models that reorder top-k vector hits by relevance to the query.
- They’re described as “what you wanted cross-encoders to be”: more accurate than cosine similarity alone but cheaper and faster than an extra full LLM call.
- Explanations emphasize: embeddings measure “looks like the question,” rerankers measure “looks like an answer.”
- Typical pattern: vector search → top N (e.g., 50) → reranker → top M (e.g., 15). Some suggest also letting a general LLM rerank when latency and cost allow.
Query Generation, Hybrid & Agentic Retrieval
- Synthetic query generation/expansion is widely endorsed for fixing poor user queries; some generate multiple variants, search in parallel, and fuse results (e.g., reciprocal rank fusion).
- Best-practice stacks often combine dense vectors + BM25 and a reranker; embeddings alone are seen as inadequate, especially for technical terms.
- Several comments advocate “agentic RAG”: giving the LLM search tools, letting it reformulate queries, do multiple rounds of search, and mix different tools and indices.
- There’s disagreement on how reliably current LLMs use tools and on latency tradeoffs; some systems are async and accept slower, deeper research.
Embedding Models and Vector Stores
- Multiple commenters are surprised the article didn’t explore more embedding models, noting newer open and commercial models often outperform OpenAI’s.
- Alternatives mentioned include Qwen3 embeddings, Gemini embeddings, Voyage, mixedbread and models ranked on newer leaderboards.
- Vector store choice is debated: S3 Vectors is praised for simplicity and cost but critiqued for higher latency and lack of sparse/keyword support; others stress picking stores that support metadata filtering and hybrid search.
UX, Evaluation, and Deployment Concerns
- Practitioners emphasize search-oriented UIs and making context/control visible, rather than opaque chat, to align user expectations.
- Metadata “injection” (titles, authors, timestamps, versions) alongside chunks is seen as important for filtering and grounding.
- Some ask how systems are evaluated (frameworks vs custom metrics) and whether performance yields real process-efficiency gains.
- There’s debate over what “self-hosted” means when many “self-hosted” stacks still require multiple third-party cloud services.
- One notable operational finding: GPT‑5 reportedly underperformed GPT‑4.1 in this RAG setting with large contexts (worse instruction following, overly long answers, tighter context window), leading the author back to GPT‑4.1.