2025-10-01

The RAG Obituary: Killed by agents, buried by context windows

Perception of the article and AI-generated prose

Several commenters feel the piece reads like LLM-written “slop”: overly chipper tone, repetitive structure, weak technical depth.
Some see it as a stealth ad (early company mention, promotional framing), others argue it’s unusually self-critical for marketing and contains useful lessons.
There’s pushback against derailing threads into “was this AI-written?” debates; moderators emphasize flagging/silent reporting rather than accusations in-thread.

What “RAG” actually means

A major thread is definitional: some use “RAG” narrowly as “embeddings + vector DB + chunking + top‑K + reranking.”
Others insist RAG is a general pattern: any retrieval (BM25, SQL, grep, APIs, tools) that augments LLM generation.
This leads to disagreement over claims like “grep isn’t RAG” vs. “grep + LLM is just primitive RAG.”

Grep / agentic search vs traditional RAG

Pro‑agentic/grep side:
- Larger context windows let models read full files/docs; a simple grep/ripgrep loop plus iterative querying often beats complex pipelines for code and some document sets.
- Agents can chain multiple searches, refine terms, follow references, and write notes/markdown “memory” files, approximating how humans work.
- Traditional RAG pipelines (chunking, embeddings, vector DBs, rerankers) are fiddly, brittle, and expensive to build and maintain, especially with permissions.
Skeptical side:
- Grep fails on synonyms, paraphrases, and vocabulary mismatch—exactly where embeddings shine.
- Codebases are a best-case corpus; unstructured enterprise text, regulations, and huge tenders require semantic retrieval and ranking.
- “Agentic search” typically includes RAG components (hybrid search, embeddings, rerankers); it’s more like “RAG inside a loop” than a replacement.

Scale, cost, and context windows

Commenters stress scaling limits: millions of docs, trillion-token corpora, or billion-token tenders can’t just be “thrown into context,” even with 1–10M token windows.
Context rot, latency, and cost remain hard constraints; embeddings and rerankers are still valuable for narrowing from millions to dozens.
Some argue LLM costs are trending down; others note energy costs, capex, and lack of profitability mean “near zero” is unlikely for cutting-edge models.

Consensus-ish views

Top‑K/vector‑only RAG is increasingly inadequate on its own.
Future systems will blend: agentic workflows, multiple retrieval tools (including grep), hybrid/graph RAG, and smarter orchestration—retrieval isn’t dead, but its role is changing.

Related topics