Retrieval-Augmented Generation (RAG)

Definition

Retrieval-Augmented Generation (RAG) is an architectural pattern that combines an information retrieval system with a generative LLM. Instead of relying solely on the model’s parametric knowledge, RAG retrieves relevant documents from a knowledge base and passes them as context to the LLM for grounded answer generation.

The pattern solves two LLM limitations: knowledge cutoff (the model doesn’t know information after its training) and hallucinations (the model invents plausible but false facts).

How It Works

The basic flow of a RAG system:

Indexing (offline): documents are segmented into chunks, converted to embeddings, and stored in a vector database
Retrieval (runtime): user query is converted to embedding and used to retrieve the most similar chunks (typically top-k by cosine similarity)
Augmentation: retrieved chunks are inserted into the prompt as context
Generation: the LLM generates a response based on provided context

Architectural Variants

Naive RAG: the basic flow described above. Simple to implement, but limited on complex queries.

Advanced RAG: adds pre-retrieval (query rewriting, HyDE) and post-retrieval (reranking, filtering) to improve quality.

Modular RAG: interchangeable components (retriever, reranker, generator) for task-specific optimization.

Agentic RAG: the LLM dynamically decides when and what to retrieve, with iterative retrieval-reasoning loops.

Evaluation Metrics

Retrieval quality:

Precision@k: how many retrieved documents are relevant
Recall@k: how many relevant documents were retrieved
MRR (Mean Reciprocal Rank): average position of first relevant result

Generation quality:

Faithfulness: is the response supported by documents?
Answer relevance: does the response answer the question?
Context relevance: are retrieved documents pertinent?

Frameworks like RAGAS automate these evaluations.

Practical Considerations

Chunking: document segmentation influences retrieval quality as much as embedding model choice. Small chunks lose context, large chunks dilute signal. Common strategies: fixed-size with overlap, semantic chunking, document-structure-aware.

Number of chunks (k): more chunks provide more context but increase cost, latency, and confusion risk. Typical values: 3-10, optimized empirically.

Costs: main cost is in the LLM (chunks go into prompt). With large context windows and many chunks, costs scale rapidly.

When to Use RAG vs Alternatives

RAG is appropriate when:

Knowledge base changes frequently
Need to cite specific sources
Corpus too large for fine-tuning
Combining parametric and document knowledge needed

Alternatives to consider:

Fine-tuning: for static knowledge and well-defined tasks
Long-context models: for small corpora fitting in context window
Hybrid: fine-tuning + RAG to combine benefits

Common Misconceptions

”RAG eliminates hallucinations”

It reduces them, not eliminates. The model can still ignore context, misinterpret it, or synthesize information not present. Guardrails and evaluation are required.

”A vector database is enough”

Retrieval is only part. Chunking strategy, embedding model, reranking, prompt engineering are all critical factors. Poorly configured RAG underperforms vanilla LLM.

”More documents = better answers”

No. Too much context can confuse the model (“lost in the middle” problem) and increases costs. Retrieval quality beats quantity.

Embeddings: vector representations for retrieval
Vector Database: storage for embeddings and similarity search
LLM: generative component of RAG
Hallucination: problem RAG partially mitigates

Sources

Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS
Es, S. et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv
Gao, Y. et al. (2024). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv