TechLead
Lesson 8 of 24
6 min read
AI Agents & RAG

RAG Fundamentals

Understand Retrieval-Augmented Generation architecture, why it matters, and how it works end-to-end

What is RAG?

Retrieval-Augmented Generation (RAG) is an architecture pattern that enhances LLM responses by retrieving relevant information from external knowledge sources before generating an answer. Instead of relying solely on the model's training data (which has a knowledge cutoff), RAG dynamically fetches up-to-date, domain-specific information from your own data sources.

RAG was introduced by Lewis et al. in 2020 and has become the standard approach for building knowledge-grounded AI applications. It solves two fundamental LLM limitations: hallucination (making up facts) and knowledge staleness (training data becomes outdated).

Why Use RAG?

  • Reduce Hallucinations: Ground responses in actual documents and data
  • Fresh Knowledge: Access up-to-date information beyond the training cutoff
  • Domain Specificity: Answer questions about your proprietary data, docs, and code
  • Source Attribution: Cite the exact documents that support each answer
  • Cost Effective: Cheaper than fine-tuning for most knowledge-grounding use cases
  • Data Privacy: Keep sensitive data in your own infrastructure, not in model weights

RAG Architecture

The RAG pipeline consists of two main phases: Indexing (offline) and Retrieval + Generation (online).

RAG Pipeline Steps

Phase Step Description
Indexing1. LoadIngest documents from files, APIs, databases, or web
Indexing2. ChunkSplit documents into smaller, meaningful pieces
Indexing3. EmbedConvert chunks into vector embeddings
Indexing4. StoreSave embeddings in a vector database
Query5. Embed QueryConvert user question into a vector
Query6. RetrieveFind most similar document chunks
Query7. GeneratePass retrieved context + question to LLM

Building a Basic RAG Pipeline

Let's build a complete RAG system from scratch to understand every component.

// Complete RAG pipeline in TypeScript
import { OpenAIEmbeddings } from "@langchain/openai";
import { ChatAnthropic } from "@langchain/anthropic";
import { ChromaClient } from "chromadb";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import * as fs from "fs";

// Step 1: Load documents
function loadDocuments(directory: string): { content: string; source: string }[] {
  const files = fs.readdirSync(directory);
  return files
    .filter(f => f.endsWith(".txt") || f.endsWith(".md"))
    .map(f => ({
      content: fs.readFileSync(`${directory}/${f}`, "utf-8"),
      source: f,
    }));
}

// Step 2: Chunk documents
async function chunkDocuments(docs: { content: string; source: string }[]) {
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 200,
    separators: ["\n\n", "\n", ". ", " ", ""],
  });

  const chunks: { content: string; source: string; index: number }[] = [];

  for (const doc of docs) {
    const splits = await splitter.splitText(doc.content);
    splits.forEach((text, i) => {
      chunks.push({ content: text, source: doc.source, index: i });
    });
  }

  return chunks;
}

// Step 3 & 4: Embed and store
async function indexDocuments(chunks: { content: string; source: string; index: number }[]) {
  const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
  const chroma = new ChromaClient();

  const collection = await chroma.getOrCreateCollection({ name: "my_docs" });

  // Batch embed and store
  const batchSize = 100;
  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const vectors = await embeddings.embedDocuments(batch.map(c => c.content));

    await collection.add({
      ids: batch.map((_, j) => `chunk_${i + j}`),
      embeddings: vectors,
      documents: batch.map(c => c.content),
      metadatas: batch.map(c => ({ source: c.source, index: c.index })),
    });
  }

  return collection;
}

// Steps 5-7: Query pipeline
async function queryRAG(question: string): Promise<string> {
  const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
  const chroma = new ChromaClient();
  const collection = await chroma.getCollection({ name: "my_docs" });
  const llm = new ChatAnthropic({ model: "claude-sonnet-4-20250514" });

  // Step 5: Embed the question
  const queryVector = await embeddings.embedQuery(question);

  // Step 6: Retrieve relevant chunks
  const results = await collection.query({
    queryEmbeddings: [queryVector],
    nResults: 5,
  });

  const context = results.documents?.[0]?.join("\n\n") || "";
  const sources = results.metadatas?.[0]?.map((m: any) => m.source) || [];

  // Step 7: Generate answer with context
  const response = await llm.invoke([
    {
      role: "system",
      content: `You are a helpful assistant. Answer questions based ONLY on the provided context.
If the context doesn't contain the answer, say "I don't have enough information to answer that."
Always cite your sources.

Context:
${context}`,
    },
    { role: "user", content: question },
  ]);

  return `${response.content}\n\nSources: ${[...new Set(sources)].join(", ")}`;
}

// Run the full pipeline
const docs = loadDocuments("./knowledge-base");
const chunks = await chunkDocuments(docs);
await indexDocuments(chunks);
const answer = await queryRAG("What is our company's vacation policy?");
console.log(answer);
# Complete RAG pipeline in Python
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_anthropic import ChatAnthropic
from langchain_chroma import Chroma

# Step 1: Load documents
loader = DirectoryLoader(
    "./knowledge-base",
    glob="**/*.{txt,md}",
    loader_cls=TextLoader,
)
documents = loader.load()
print(f"Loaded {len(documents)} documents")

# Step 2: Chunk documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

# Steps 3 & 4: Embed and store in vector database
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="my_docs",
)

# Steps 5-7: Query pipeline
def query_rag(question: str) -> str:
    # Step 5 & 6: Retrieve relevant chunks
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 5},
    )
    relevant_docs = retriever.invoke(question)

    # Build context from retrieved documents
    context = "\n\n".join(doc.page_content for doc in relevant_docs)
    sources = set(doc.metadata.get("source", "unknown") for doc in relevant_docs)

    # Step 7: Generate answer
    llm = ChatAnthropic(model="claude-sonnet-4-20250514")
    response = llm.invoke([
        {"role": "system", "content": f"""Answer based ONLY on the context below.
If the context doesn't contain the answer, say so. Cite your sources.

Context:
{context}"""},
        {"role": "user", "content": question},
    ])

    return f"{response.content}\n\nSources: {', '.join(sources)}"

# Usage
answer = query_rag("What is our company's vacation policy?")
print(answer)

RAG vs. Fine-Tuning vs. Prompt Engineering

Approach Best For Cost Data Freshness
Prompt EngineeringSmall context, simple tasksLowManual updates
RAGLarge knowledge bases, dynamic dataMediumReal-time
Fine-TuningSpecialized behavior, domain languageHighRequires retraining

Common RAG Mistakes

  • Chunks too large: Retrieving 5000-word chunks wastes context window and dilutes relevance
  • Chunks too small: Tiny chunks lose context and produce fragmented retrieval
  • No overlap: Without chunk overlap, important information at boundaries gets lost
  • Wrong embedding model: Using a general model for a specialized domain produces poor retrieval
  • No source attribution: Users need to verify answers — always return the source documents

Summary

RAG is the most practical approach to building knowledge-grounded AI applications. By combining retrieval with generation, you get the reasoning power of LLMs with the accuracy of your own data. The pipeline — load, chunk, embed, store, retrieve, generate — is the foundation that all advanced RAG techniques build upon. Master the basics before exploring the optimizations covered in upcoming lessons.

Continue Learning