AI Techniques DefinedTerm

Retrieval-Augmented Generation (RAG)

Also known as: RAG, Retrieval Augmented Generation

Architectural pattern that combines information retrieval from a knowledge base with LLM generation for answers grounded in specific documents.

Updated: 2026-01-03

Definition

Retrieval-Augmented Generation (RAG) is an architectural pattern that combines an information retrieval system with a generative LLM. Instead of relying solely on the model’s parametric knowledge, RAG retrieves relevant documents from a knowledge base and passes them as context to the LLM for grounded answer generation.

The pattern solves two LLM limitations: knowledge cutoff (the model doesn’t know information after its training) and hallucinations (the model invents plausible but false facts).

How It Works

The basic flow of a RAG system:

  1. Indexing (offline): documents are segmented into chunks, converted to embeddings, and stored in a vector database

  2. Retrieval (runtime): user query is converted to embedding and used to retrieve the most similar chunks (typically top-k by cosine similarity)

  3. Augmentation: retrieved chunks are inserted into the prompt as context

  4. Generation: the LLM generates a response based on provided context

Architectural Variants

Naive RAG: the basic flow described above. Simple to implement, but limited on complex queries.

Advanced RAG: adds pre-retrieval (query rewriting, HyDE) and post-retrieval (reranking, filtering) to improve quality.

Modular RAG: interchangeable components (retriever, reranker, generator) for task-specific optimization.

Agentic RAG: the LLM dynamically decides when and what to retrieve, with iterative retrieval-reasoning loops.

Evaluation Metrics

Retrieval quality:

  • Precision@k: how many retrieved documents are relevant
  • Recall@k: how many relevant documents were retrieved
  • MRR (Mean Reciprocal Rank): average position of first relevant result

Generation quality:

  • Faithfulness: is the response supported by documents?
  • Answer relevance: does the response answer the question?
  • Context relevance: are retrieved documents pertinent?

Frameworks like RAGAS automate these evaluations.

Practical Considerations

Chunking: document segmentation influences retrieval quality as much as embedding model choice. Small chunks lose context, large chunks dilute signal. Common strategies: fixed-size with overlap, semantic chunking, document-structure-aware.

Number of chunks (k): more chunks provide more context but increase cost, latency, and confusion risk. Typical values: 3-10, optimized empirically.

Costs: main cost is in the LLM (chunks go into prompt). With large context windows and many chunks, costs scale rapidly.

When to Use RAG vs Alternatives

RAG is appropriate when:

  • Knowledge base changes frequently
  • Need to cite specific sources
  • Corpus too large for fine-tuning
  • Combining parametric and document knowledge needed

Alternatives to consider:

  • Fine-tuning: for static knowledge and well-defined tasks
  • Long-context models: for small corpora fitting in context window
  • Hybrid: fine-tuning + RAG to combine benefits

Common Misconceptions

”RAG eliminates hallucinations”

It reduces them, not eliminates. The model can still ignore context, misinterpret it, or synthesize information not present. Guardrails and evaluation are required.

”A vector database is enough”

Retrieval is only part. Chunking strategy, embedding model, reranking, prompt engineering are all critical factors. Poorly configured RAG underperforms vanilla LLM.

”More documents = better answers”

No. Too much context can confuse the model (“lost in the middle” problem) and increases costs. Retrieval quality beats quantity.

Sources