Retrieval-augmented generation (RAG)
Definition
Retrieval-augmented generation (RAG) is an AI architecture that gives a large language model real-time access to external documents at query time — retrieving relevant passages from a vector database or search index and inserting them into the model's context before it generates a response. RAG is the foundation of modern AI search and the most effective technique for reducing hallucination.
How it works
A RAG pipeline runs four steps for every query:
-
Embedding the query. The user's question is converted into a numerical vector that captures its semantic meaning. The same embedding model that indexed the source documents is used here so query and documents share the same vector space.
-
Retrieving relevant chunks. The system searches an index (typically a vector database, often combined with keyword search for hybrid retrieval) for the document chunks most semantically similar to the query. Modern systems usually return 5–20 chunks and rerank them with a more precise model.
-
Augmenting the prompt. The top chunks are inserted into the prompt as context, alongside the original question and a system instruction telling the model to ground its answer in the provided sources.
-
Generating the response. The LLM produces an answer using the retrieved context. Most production systems return source citations so users can verify claims against the original documents.
Why it matters
RAG solves three fundamental LLM limitations.
Knowledge cutoffs. Models only know what was in their training data through a cutoff date. RAG provides current information at query time, so a model can answer questions about events, prices, or documentation that did not exist when the model was trained.
Private data access. Models cannot answer from data that was never in their training corpus — internal documentation, customer records, proprietary research. RAG lets models answer from any document set you put behind a retrieval index.
Hallucination reduction. When models have to ground their output in retrieved text rather than relying on parametric memory, they fabricate less. Industry analysis shows that properly implemented RAG can reduce hallucination by up to 71% on grounded benchmarks.
RAG is also what powers AI search itself. Perplexity, Google AI Overviews, ChatGPT search, and Claude with web search are all RAG systems — the difference is that the document index is the open web rather than a private corpus.
2020
Year RAG was introduced by Lewis et al. at Facebook AI Research
Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive Tasks"
Up to 71%
Hallucination reduction from properly implemented RAG vs unaugmented generation
Industry analysis, 2026
100–300 words
Standard chunk size for production RAG systems
Industry consensus
How to measure RAG quality
RAG systems fail in two distinct ways, and you need separate metrics for each.
Retrieval quality. Did the system find the right documents? Measured by recall (the percentage of relevant documents that were retrieved) and precision (the percentage of retrieved documents that were relevant). Poor retrieval guarantees poor generation — no model can answer well from irrelevant context.
Generation quality. Given the retrieved context, did the model use it correctly? Measured by faithfulness (the response stays grounded in the retrieved sources without contradiction or fabrication) and answer relevance (the response actually addresses the question). The Vectara HHEM Leaderboard is the most widely referenced benchmark for grounded faithfulness.
Both metrics matter. A retrieval system with 90% recall feeding a generation step that hallucinates 30% of the time still produces a 27%-broken end product.
How to improve RAG
Five techniques have measured impact:
-
Hybrid retrieval. Combine dense vector search (semantic) with sparse keyword search (BM25 or similar). Each catches relevance signals the other misses.
-
Reranking. After initial retrieval returns 20–50 candidate chunks, rerank them with a cross-encoder model that scores each query-chunk pair more precisely. Reranking significantly improves precision at the top of the result list.
-
Chunking strategy. Source documents must be split into passages — too small and the model loses context, too large and irrelevant content dilutes the relevant signal. 100–300 word chunks with 10–20% overlap is the production-grade default for most domains.
-
Query rewriting. The user's literal query is often a poor retrieval query. Modern systems rewrite or expand queries before retrieval — converting "how do I do X" into structured search terms that match how documentation is actually written.
-
Evaluation pipelines. Build a ground-truth set of 50–500 query-answer pairs and run it on every change. RAG quality drifts silently as documents change and embeddings age. Continuous evaluation catches regressions before users do.
Frequently asked questions
How is RAG different from fine-tuning?
Fine-tuning adjusts model weights through additional training on a target dataset. RAG keeps the model unchanged and provides knowledge through retrieved context at query time. Fine-tuning teaches the model new behavior; RAG gives the model access to new information. They are complementary — many production systems combine both.
Does RAG work with any LLM?
Yes. RAG is model-agnostic — any LLM that accepts a prompt with injected context can be the generator in a RAG pipeline. Practical considerations like context window size and instruction-following quality vary by model, but the architecture itself works across OpenAI, Anthropic, Google, and open-source models.
Is RAG the same as AI search?
AI search is one application of RAG. Perplexity, Google AI Overviews, ChatGPT search, and Claude with web search are all RAG systems where the document index is the open web. Enterprise RAG systems use the same architecture against private document sets — internal documentation, customer records, proprietary research.
What is the difference between RAG and grounding?
Grounding is the broader concept — connecting model output to verifiable sources. RAG is the specific implementation that grounds responses by retrieving documents at query time. All RAG is grounding; not all grounding is RAG. Tool use and function calling are other forms of grounding.
Can RAG eliminate hallucination?
No. Properly implemented RAG significantly reduces hallucination — up to 71% on grounded benchmarks — but does not eliminate it. Models can still misread retrieved context, ignore the grounding instructions, or hallucinate when the retrieval step fails to surface relevant content. RAG is a strong mitigation, not a fix.
AI grounding
AI grounding is the practice of anchoring an LLM's response in retrieved, citable sources at inference time — instead of letting the model rely solely on its training memory. Grounding is what separates a hallucination-prone chatbot from a search-grade AI assistant like Perplexity, Google AI Overviews, Bing Chat, or retrieval-augmented ChatGPT.
AI hallucination
AI hallucination is when a large language model generates content that sounds plausible and confident but is factually wrong, fabricated, or unverifiable — invented citations, made-up statistics, or fictional events presented with the same fluency as accurate information. Hallucination is a structural feature of how LLMs work, not a bug that can be fully eliminated.