What is RAG Fundamentals?

Understand Retrieval-Augmented Generation architecture, why it matters, and how it works end-to-end

RAG Fundamentals - AI Agents & RAG Tutorial | TechLead

Q: RAG Architecture

The RAG pipeline consists of two main phases: Indexing (offline) and Retrieval + Generation (online).

What is RAG?

Retrieval-Augmented Generation (RAG) is an architecture pattern that enhances LLM responses by retrieving relevant information from external knowledge sources before generating an answer. Instead of relying solely on the model's training data (which has a knowledge cutoff), RAG dynamically fetches up-to-date, domain-specific information from your own data sources.

RAG was introduced by Lewis et al. in 2020 and has become the standard approach for building knowledge-grounded AI applications. It solves two fundamental LLM limitations: hallucination (making up facts) and knowledge staleness (training data becomes outdated).

Why Use RAG?

Reduce Hallucinations: Ground responses in actual documents and data
Fresh Knowledge: Access up-to-date information beyond the training cutoff
Domain Specificity: Answer questions about your proprietary data, docs, and code
Source Attribution: Cite the exact documents that support each answer
Cost Effective: Cheaper than fine-tuning for most knowledge-grounding use cases
Data Privacy: Keep sensitive data in your own infrastructure, not in model weights

RAG Architecture

The RAG pipeline consists of two main phases: Indexing (offline) and Retrieval + Generation (online).

RAG Pipeline Steps

Phase	Step	Description
Indexing	1. Load	Ingest documents from files, APIs, databases, or web
Indexing	2. Chunk	Split documents into smaller, meaningful pieces
Indexing	3. Embed	Convert chunks into vector embeddings
Indexing	4. Store	Save embeddings in a vector database
Query	5. Embed Query	Convert user question into a vector
Query	6. Retrieve	Find most similar document chunks
Query	7. Generate	Pass retrieved context + question to LLM

Building a Basic RAG Pipeline

Let's build a complete RAG system from scratch to understand every component.

// Complete RAG pipeline in TypeScript
import { OpenAIEmbeddings } from "@langchain/openai";
import { ChatAnthropic } from "@langchain/anthropic";
import { ChromaClient } from "chromadb";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import * as fs from "fs";

// Step 1: Load documents
function loadDocuments(directory: string): { content: string; source: string }[] {
  const files = fs.readdirSync(directory);
  return files
    .filter(f => f.endsWith(".txt") || f.endsWith(".md"))
    .map(f => ({
      content: fs.readFileSync(`${directory}/${f}`, "utf-8"),
      source: f,
    }));
}

// Step 2: Chunk documents
async function chunkDocuments(docs: { content: string; source: string }[]) {
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 200,
    separators: ["\n\n", "\n", ". ", " ", ""],
  });

  const chunks: { content: string; source: string; index: number }[] = [];

  for (const doc of docs) {
    const splits = await splitter.splitText(doc.content);
    splits.forEach((text, i) => {
      chunks.push({ content: text, source: doc.source, index: i });
    });
  }

  return chunks;
}

// Step 3 & 4: Embed and store
async function indexDocuments(chunks: { content: string; source: string; index: number }[]) {
  const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
  const chroma = new ChromaClient();

  const collection = await chroma.getOrCreateCollection({ name: "my_docs" });

  // Batch embed and store
  const batchSize = 100;
  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);
    const vectors = await embeddings.embedDocuments(batch.map(c => c.content));

    await collection.add({
      ids: batch.map((_, j) => `chunk_${i + j}`),
      embeddings: vectors,
      documents: batch.map(c => c.content),
      metadatas: batch.map(c => ({ source: c.source, index: c.index })),
    });
  }

  return collection;
}

// Steps 5-7: Query pipeline
async function queryRAG(question: string): Promise<string> {
  const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
  const chroma = new ChromaClient();
  const collection = await chroma.getCollection({ name: "my_docs" });
  const llm = new ChatAnthropic({ model: "claude-sonnet-4-20250514" });

  // Step 5: Embed the question
  const queryVector = await embeddings.embedQuery(question);

  // Step 6: Retrieve relevant chunks
  const results = await collection.query({
    queryEmbeddings: [queryVector],
    nResults: 5,
  });

  const context = results.documents?.[0]?.join("\n\n") || "";
  const sources = results.metadatas?.[0]?.map((m: any) => m.source) || [];

  // Step 7: Generate answer with context
  const response = await llm.invoke([
    {
      role: "system",
      content: `You are a helpful assistant. Answer questions based ONLY on the provided context.
If the context doesn't contain the answer, say "I don't have enough information to answer that."
Always cite your sources.

Context:
${context}`,
    },
    { role: "user", content: question },
  ]);

  return `${response.content}\n\nSources: ${[...new Set(sources)].join(", ")}`;
}

// Run the full pipeline
const docs = loadDocuments("./knowledge-base");
const chunks = await chunkDocuments(docs);
await indexDocuments(chunks);
const answer = await queryRAG("What is our company's vacation policy?");
console.log(answer);

# Complete RAG pipeline in Python
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_anthropic import ChatAnthropic
from langchain_chroma import Chroma

# Step 1: Load documents
loader = DirectoryLoader(
    "./knowledge-base",
    glob="**/*.{txt,md}",
    loader_cls=TextLoader,
)
documents = loader.load()
print(f"Loaded {len(documents)} documents")

# Step 2: Chunk documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

# Steps 3 & 4: Embed and store in vector database
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="my_docs",
)

# Steps 5-7: Query pipeline
def query_rag(question: str) -> str:
    # Step 5 & 6: Retrieve relevant chunks
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 5},
    )
    relevant_docs = retriever.invoke(question)

    # Build context from retrieved documents
    context = "\n\n".join(doc.page_content for doc in relevant_docs)
    sources = set(doc.metadata.get("source", "unknown") for doc in relevant_docs)

    # Step 7: Generate answer
    llm = ChatAnthropic(model="claude-sonnet-4-20250514")
    response = llm.invoke([
        {"role": "system", "content": f"""Answer based ONLY on the context below.
If the context doesn't contain the answer, say so. Cite your sources.

Context:
{context}"""},
        {"role": "user", "content": question},
    ])

    return f"{response.content}\n\nSources: {', '.join(sources)}"

# Usage
answer = query_rag("What is our company's vacation policy?")
print(answer)

RAG vs. Fine-Tuning vs. Prompt Engineering

Approach	Best For	Cost	Data Freshness
Prompt Engineering	Small context, simple tasks	Low	Manual updates
RAG	Large knowledge bases, dynamic data	Medium	Real-time
Fine-Tuning	Specialized behavior, domain language	High	Requires retraining

Common RAG Mistakes

Chunks too large: Retrieving 5000-word chunks wastes context window and dilutes relevance
Chunks too small: Tiny chunks lose context and produce fragmented retrieval
No overlap: Without chunk overlap, important information at boundaries gets lost
Wrong embedding model: Using a general model for a specialized domain produces poor retrieval
No source attribution: Users need to verify answers — always return the source documents

Summary

RAG is the most practical approach to building knowledge-grounded AI applications. By combining retrieval with generation, you get the reasoning power of LLMs with the accuracy of your own data. The pipeline — load, chunk, embed, store, retrieve, generate — is the foundation that all advanced RAG techniques build upon. Master the basics before exploring the optimizations covered in upcoming lessons.

RAG Fundamentals

What is RAG?

Why Use RAG?

RAG Architecture

RAG Pipeline Steps

Building a Basic RAG Pipeline

RAG vs. Fine-Tuning vs. Prompt Engineering

Common RAG Mistakes

Summary

Continue Learning

AI-Native Engineering

AI & Machine Learning

LangChain

Python

Vercel AI SDK