RAG (Retrieval-Augmented Generation)

Retrieval-Augmented Generation (RAG) is a technique that retrieves relevant external information and feeds it into an LLM's context before the model generates a response.

Also known as: RAG, retrieval-augmented generation

What is RAG?

Retrieval-Augmented Generation (RAG) is a way to ground a language model's answers in information it was never trained on. Rather than relying solely on what the model memorized, a RAG system first searches an external knowledge base for content relevant to the question, then places that content into the prompt so the model can reason over it directly.

The technique addresses the two structural limits of any LLM: a training cutoff (the model knows nothing after it) and a closed corpus (it never saw your private documents). RAG sidesteps both by injecting the missing knowledge at query time.

How a RAG pipeline works

A typical pipeline has an indexing phase and a query phase. During indexing, documents are split into chunks, each chunk is converted into an embedding — a numeric vector capturing its meaning — and the vectors are stored in a vector database. At query time, the user's question is embedded the same way, the database returns the chunks whose vectors sit closest to the question's, and those chunks are concatenated into the model's context window alongside the question.

Production systems layer refinements on top: hybrid search that mixes keyword and vector matching, rerankers that re-score the retrieved candidates, and chunking strategies tuned to the document type. But the skeleton — embed, store, retrieve, augment, generate — stays the same.

Why RAG matters

Grounding generation in retrieved text measurably reduces hallucination, because the model can quote and cite rather than reconstruct from memory. It also keeps knowledge current without retraining: update the index and the system's answers update with it. And it makes provenance possible — a RAG answer can point at the exact chunks it drew from, which matters anywhere answers must be auditable, from legal research to internal support bots.

RAG and MCP

In agent systems, retrieval increasingly arrives as a tool call rather than a built-in pipeline stage. Many MCP servers exist specifically to provide RAG-style retrieval — over library documentation, codebases, scraped web content, or proprietary datasets — so any MCP client can bolt grounded search onto its model without building an indexing pipeline of its own.

This reframes RAG as a service boundary: the hard parts (corpus maintenance, embedding, index freshness) live behind the server, and the agent just asks questions. The same server then works identically across Claude Desktop, Cursor, or any other MCP client.

Monetizing retrieval

Retrieval is one of the cleanest fits for per-call pricing, because each query has an obvious unit of value and a real marginal cost (embedding, vector search, sometimes reranking). A retrieval-focused MCP server listed on Loomal can attach an x402 price per query — minimum $0.01 — and agents pay in USDC, settled on Base in about two seconds, before the search runs. Curated, well-maintained corpora are exactly the kind of asset agents will pay to query rather than rebuild.

Embedding Vector Database Large Language Model Context Window MCP Server