Technical Fundamentals

RAG

Retrieval-Augmented Generation — enhancing LLM outputs by retrieving relevant documents or data at inference time.

Retrieval-Augmented Generation (RAG) is a technique that enhances LLM outputs by dynamically retrieving relevant information from an external knowledge base at inference time, then including that information in the model's context before generating a response.

RAG addresses two fundamental limitations of LLMs: static training data (models don't know about events after their training cutoff) and hallucination (models may confidently generate false information when they lack relevant knowledge).

A RAG pipeline typically involves:

1. Indexing: Documents, databases, or other knowledge sources are chunked and converted into semantic embeddings, stored in a vector database.

2. Retrieval: When a query arrives, it is embedded and used to perform a similarity search against the vector database, retrieving the most relevant chunks.

3. Augmentation: Retrieved chunks are injected into the LLM's prompt as context, grounding the model's response in retrieved facts.

4. Generation: The LLM generates a response using both its parametric knowledge and the retrieved context.

In agentic systems, RAG is often implemented as a tool that agents can call to retrieve relevant knowledge before taking action. An agent handling a customer support request might first retrieve the customer's account history and relevant product documentation before formulating a response or resolution.

Advanced RAG architectures (HyDE, RAPTOR, GraphRAG) address limitations of naive RAG by improving retrieval quality for complex queries, multi-hop reasoning tasks, and knowledge-intensive domains.