What is RAG?
Retrieval-Augmented Generation (RAG) is an architecture pattern that enhances LLM responses by retrieving relevant information from external knowledge sources before generating an answer. Instead of relying solely on the model's training data (which has a knowledge cutoff), RAG dynamically fetches up-to-date, domain-specific information from your own data sources.
RAG was introduced by Lewis et al. in 2020 and has become the standard approach for building knowledge-grounded AI applications. It solves two fundamental LLM limitations: hallucination (making up facts) and knowledge staleness (training data becomes outdated).
Why Use RAG?
- Reduce Hallucinations: Ground responses in actual documents and data
- Fresh Knowledge: Access up-to-date information beyond the training cutoff
- Domain Specificity: Answer questions about your proprietary data, docs, and code
- Source Attribution: Cite the exact documents that support each answer
- Cost Effective: Cheaper than fine-tuning for most knowledge-grounding use cases
- Data Privacy: Keep sensitive data in your own infrastructure, not in model weights
RAG Architecture
The RAG pipeline consists of two main phases: Indexing (offline) and Retrieval + Generation (online).
RAG Pipeline Steps
| Phase | Step | Description |
|---|---|---|
| Indexing | 1. Load | Ingest documents from files, APIs, databases, or web |
| Indexing | 2. Chunk | Split documents into smaller, meaningful pieces |
| Indexing | 3. Embed | Convert chunks into vector embeddings |
| Indexing | 4. Store | Save embeddings in a vector database |
| Query | 5. Embed Query | Convert user question into a vector |
| Query | 6. Retrieve | Find most similar document chunks |
| Query | 7. Generate | Pass retrieved context + question to LLM |
Building a Basic RAG Pipeline
Let's build a complete RAG system from scratch to understand every component.
// Complete RAG pipeline in TypeScript
import { OpenAIEmbeddings } from "@langchain/openai";
import { ChatAnthropic } from "@langchain/anthropic";
import { ChromaClient } from "chromadb";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import * as fs from "fs";
// Step 1: Load documents
function loadDocuments(directory: string): { content: string; source: string }[] {
const files = fs.readdirSync(directory);
return files
.filter(f => f.endsWith(".txt") || f.endsWith(".md"))
.map(f => ({
content: fs.readFileSync(`${directory}/${f}`, "utf-8"),
source: f,
}));
}
// Step 2: Chunk documents
async function chunkDocuments(docs: { content: string; source: string }[]) {
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
separators: ["\n\n", "\n", ". ", " ", ""],
});
const chunks: { content: string; source: string; index: number }[] = [];
for (const doc of docs) {
const splits = await splitter.splitText(doc.content);
splits.forEach((text, i) => {
chunks.push({ content: text, source: doc.source, index: i });
});
}
return chunks;
}
// Step 3 & 4: Embed and store
async function indexDocuments(chunks: { content: string; source: string; index: number }[]) {
const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const chroma = new ChromaClient();
const collection = await chroma.getOrCreateCollection({ name: "my_docs" });
// Batch embed and store
const batchSize = 100;
for (let i = 0; i < chunks.length; i += batchSize) {
const batch = chunks.slice(i, i + batchSize);
const vectors = await embeddings.embedDocuments(batch.map(c => c.content));
await collection.add({
ids: batch.map((_, j) => `chunk_${i + j}`),
embeddings: vectors,
documents: batch.map(c => c.content),
metadatas: batch.map(c => ({ source: c.source, index: c.index })),
});
}
return collection;
}
// Steps 5-7: Query pipeline
async function queryRAG(question: string): Promise<string> {
const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const chroma = new ChromaClient();
const collection = await chroma.getCollection({ name: "my_docs" });
const llm = new ChatAnthropic({ model: "claude-sonnet-4-20250514" });
// Step 5: Embed the question
const queryVector = await embeddings.embedQuery(question);
// Step 6: Retrieve relevant chunks
const results = await collection.query({
queryEmbeddings: [queryVector],
nResults: 5,
});
const context = results.documents?.[0]?.join("\n\n") || "";
const sources = results.metadatas?.[0]?.map((m: any) => m.source) || [];
// Step 7: Generate answer with context
const response = await llm.invoke([
{
role: "system",
content: `You are a helpful assistant. Answer questions based ONLY on the provided context.
If the context doesn't contain the answer, say "I don't have enough information to answer that."
Always cite your sources.
Context:
${context}`,
},
{ role: "user", content: question },
]);
return `${response.content}\n\nSources: ${[...new Set(sources)].join(", ")}`;
}
// Run the full pipeline
const docs = loadDocuments("./knowledge-base");
const chunks = await chunkDocuments(docs);
await indexDocuments(chunks);
const answer = await queryRAG("What is our company's vacation policy?");
console.log(answer);
# Complete RAG pipeline in Python
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_anthropic import ChatAnthropic
from langchain_chroma import Chroma
# Step 1: Load documents
loader = DirectoryLoader(
"./knowledge-base",
glob="**/*.{txt,md}",
loader_cls=TextLoader,
)
documents = loader.load()
print(f"Loaded {len(documents)} documents")
# Step 2: Chunk documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
# Steps 3 & 4: Embed and store in vector database
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
collection_name="my_docs",
)
# Steps 5-7: Query pipeline
def query_rag(question: str) -> str:
# Step 5 & 6: Retrieve relevant chunks
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5},
)
relevant_docs = retriever.invoke(question)
# Build context from retrieved documents
context = "\n\n".join(doc.page_content for doc in relevant_docs)
sources = set(doc.metadata.get("source", "unknown") for doc in relevant_docs)
# Step 7: Generate answer
llm = ChatAnthropic(model="claude-sonnet-4-20250514")
response = llm.invoke([
{"role": "system", "content": f"""Answer based ONLY on the context below.
If the context doesn't contain the answer, say so. Cite your sources.
Context:
{context}"""},
{"role": "user", "content": question},
])
return f"{response.content}\n\nSources: {', '.join(sources)}"
# Usage
answer = query_rag("What is our company's vacation policy?")
print(answer)
RAG vs. Fine-Tuning vs. Prompt Engineering
| Approach | Best For | Cost | Data Freshness |
|---|---|---|---|
| Prompt Engineering | Small context, simple tasks | Low | Manual updates |
| RAG | Large knowledge bases, dynamic data | Medium | Real-time |
| Fine-Tuning | Specialized behavior, domain language | High | Requires retraining |
Common RAG Mistakes
- Chunks too large: Retrieving 5000-word chunks wastes context window and dilutes relevance
- Chunks too small: Tiny chunks lose context and produce fragmented retrieval
- No overlap: Without chunk overlap, important information at boundaries gets lost
- Wrong embedding model: Using a general model for a specialized domain produces poor retrieval
- No source attribution: Users need to verify answers — always return the source documents
Summary
RAG is the most practical approach to building knowledge-grounded AI applications. By combining retrieval with generation, you get the reasoning power of LLMs with the accuracy of your own data. The pipeline — load, chunk, embed, store, retrieve, generate — is the foundation that all advanced RAG techniques build upon. Master the basics before exploring the optimizations covered in upcoming lessons.