Why Chunking Matters
Chunking is the process of splitting documents into smaller pieces for embedding and retrieval. It's arguably the most impactful factor in RAG quality. Poor chunking leads to poor retrieval, and poor retrieval leads to irrelevant or incorrect answers — no matter how good your LLM is.
The goal is to create chunks that are semantically coherent (each chunk contains a complete thought or topic), appropriately sized (large enough for context, small enough for precision), and well-overlapped (important information at boundaries isn't lost).
Chunking Strategy Quick Guide
- Fixed-Size: Simple, predictable — good baseline for homogeneous documents
- Recursive Character: Splits on natural boundaries (paragraphs, sentences) — best general-purpose
- Semantic: Uses embeddings to find natural topic boundaries — best quality, higher cost
- Parent-Child: Small chunks for retrieval, large chunks for context — best of both worlds
- Document-Specific: Markdown headers, code functions, HTML tags — best for structured docs
1. Fixed-Size Chunking
The simplest approach — split text into chunks of a fixed number of characters or tokens, with optional overlap.
// Fixed-size chunking
function fixedSizeChunk(
text: string,
chunkSize: number = 1000,
overlap: number = 200
): string[] {
const chunks: string[] = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + chunkSize, text.length);
chunks.push(text.slice(start, end));
start += chunkSize - overlap;
}
return chunks;
}
// Token-based chunking (more accurate for LLMs)
import { encoding_for_model } from "tiktoken";
function tokenBasedChunk(
text: string,
maxTokens: number = 512,
overlapTokens: number = 50
): string[] {
const encoder = encoding_for_model("gpt-4");
const tokens = encoder.encode(text);
const chunks: string[] = [];
let start = 0;
while (start < tokens.length) {
const end = Math.min(start + maxTokens, tokens.length);
const chunkTokens = tokens.slice(start, end);
chunks.push(new TextDecoder().decode(encoder.decode(chunkTokens)));
start += maxTokens - overlapTokens;
}
encoder.free();
return chunks;
}
2. Recursive Character Splitting
The most popular general-purpose strategy. It tries to split on natural boundaries (paragraphs, then sentences, then words) while respecting a maximum chunk size.
# Recursive character splitting with LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Standard configuration for most documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Target chunk size in characters
chunk_overlap=200, # Overlap between consecutive chunks
length_function=len, # How to measure chunk length
separators=[
"\n\n", # First try: split on double newlines (paragraphs)
"\n", # Then: single newlines
". ", # Then: sentences
", ", # Then: clauses
" ", # Then: words
"", # Last resort: characters
],
is_separator_regex=False,
)
text = """
Machine Learning Fundamentals
Machine learning is a subset of artificial intelligence that enables
systems to learn and improve from experience. It focuses on developing
algorithms that can access data and use it to learn for themselves.
Types of Machine Learning
Supervised learning uses labeled datasets to train algorithms to classify
data or predict outcomes. The model learns by comparing its output with
the correct answers during training.
Unsupervised learning finds hidden patterns in data without labeled
responses. Clustering and dimensionality reduction are common techniques.
"""
chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i} ({len(chunk)} chars): {chunk[:80]}...")
3. Semantic Chunking
Semantic chunking uses embeddings to detect topic boundaries. Adjacent sentences with similar embeddings stay together; when similarity drops significantly, a new chunk begins.
# Semantic chunking using embeddings
from sentence_transformers import SentenceTransformer
import numpy as np
def semantic_chunk(
text: str,
threshold: float = 0.5,
min_chunk_size: int = 100,
max_chunk_size: int = 2000,
) -> list[str]:
"""Split text into semantically coherent chunks."""
model = SentenceTransformer("all-MiniLM-L6-v2")
# Split into sentences
sentences = [s.strip() for s in text.split(". ") if s.strip()]
if len(sentences) <= 1:
return [text]
# Embed all sentences
embeddings = model.encode(sentences)
# Calculate cosine similarity between consecutive sentences
similarities = []
for i in range(len(embeddings) - 1):
sim = np.dot(embeddings[i], embeddings[i + 1]) / (
np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i + 1])
)
similarities.append(sim)
# Find split points where similarity drops below threshold
chunks = []
current_chunk = [sentences[0]]
for i, sim in enumerate(similarities):
current_text = ". ".join(current_chunk)
if sim < threshold and len(current_text) >= min_chunk_size:
chunks.append(current_text + ".")
current_chunk = [sentences[i + 1]]
elif len(current_text) >= max_chunk_size:
chunks.append(current_text + ".")
current_chunk = [sentences[i + 1]]
else:
current_chunk.append(sentences[i + 1])
if current_chunk:
chunks.append(". ".join(current_chunk) + ".")
return chunks
# LangChain also provides SemanticChunker
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
semantic_splitter = SemanticChunker(
OpenAIEmbeddings(model="text-embedding-3-small"),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90,
)
chunks = semantic_splitter.split_text(text)
for chunk in chunks:
print(f"[{len(chunk)} chars] {chunk[:100]}...")
4. Parent-Child (Hierarchical) Chunking
This strategy creates small chunks for precise retrieval but returns the larger parent chunk (or full document section) for context. You get the precision of small chunks with the context of large ones.
// Parent-child chunking strategy
interface ParentChunk {
id: string;
content: string;
children: ChildChunk[];
}
interface ChildChunk {
id: string;
content: string;
parentId: string;
}
function parentChildChunk(
text: string,
parentSize: number = 2000,
childSize: number = 400,
childOverlap: number = 50
): { parents: ParentChunk[]; children: ChildChunk[] } {
const parents: ParentChunk[] = [];
const children: ChildChunk[] = [];
// Create parent chunks (large)
const parentTexts = splitBySize(text, parentSize, 200);
parentTexts.forEach((parentText, pi) => {
const parentId = `parent_${pi}`;
const parent: ParentChunk = {
id: parentId,
content: parentText,
children: [],
};
// Create child chunks (small) within each parent
const childTexts = splitBySize(parentText, childSize, childOverlap);
childTexts.forEach((childText, ci) => {
const child: ChildChunk = {
id: `child_${pi}_${ci}`,
content: childText,
parentId,
};
parent.children.push(child);
children.push(child);
});
parents.push(parent);
});
return { parents, children };
}
// During retrieval: search children, return parents
async function retrieveWithParentContext(
query: string,
childCollection: any,
parentMap: Map<string, string>,
k: number = 3
): Promise<string[]> {
// Search over small child chunks (precise matching)
const results = await childCollection.query({
queryTexts: [query],
nResults: k,
});
// Return the larger parent chunks (rich context)
const parentIds = new Set(
results.metadatas[0].map((m: any) => m.parentId)
);
return [...parentIds].map(id => parentMap.get(id) || "");
}
5. Document-Specific Chunking
# Markdown header-based chunking
from langchain.text_splitter import MarkdownHeaderTextSplitter
markdown_text = """
# Introduction
This is the introduction section.
## Background
Here is some background information with important context.
## Methods
### Data Collection
We collected data from multiple sources.
### Analysis
Statistical analysis was performed using Python.
# Results
The results show significant improvement.
"""
headers_to_split = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split)
chunks = splitter.split_text(markdown_text)
for chunk in chunks:
print(f"Headers: {chunk.metadata} -> {chunk.page_content[:60]}...")
# Code-aware chunking
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1000,
chunk_overlap=100,
)
js_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.JS,
chunk_size=1000,
chunk_overlap=100,
)
Choosing Chunk Size
Chunk Size Guidelines
| Document Type | Chunk Size | Overlap | Strategy |
|---|---|---|---|
| General text | 500-1000 chars | 100-200 | Recursive |
| Technical docs | 1000-1500 chars | 200-300 | Markdown-aware |
| Legal/medical | 800-1200 chars | 200 | Semantic |
| Code files | 1000-2000 chars | 100 | Language-aware |
| Q&A / FAQ | 200-500 chars | 0-50 | Fixed or by item |
Chunking Best Practices
- Always include metadata: Store source file, page number, section header, and chunk index with each chunk
- Test with real queries: The best chunk size depends on your actual questions — experiment and measure
- Use overlap: 10-20% overlap prevents losing information at chunk boundaries
- Preserve structure: Don't split in the middle of tables, code blocks, or lists
- Consider augmenting chunks: Add a summary or title to each chunk for better embedding quality
Summary
Chunking is the foundation of RAG quality. Start with recursive character splitting as your baseline, measure retrieval quality, then try semantic or parent-child chunking if needed. The right strategy depends on your document types, query patterns, and quality requirements. Always validate with real queries and iterate.