TechLead
Lesson 9 of 24
6 min read
AI Agents & RAG

Chunking Strategies

Master fixed-size, semantic, recursive, and parent-child chunking strategies for optimal RAG retrieval

Why Chunking Matters

Chunking is the process of splitting documents into smaller pieces for embedding and retrieval. It's arguably the most impactful factor in RAG quality. Poor chunking leads to poor retrieval, and poor retrieval leads to irrelevant or incorrect answers — no matter how good your LLM is.

The goal is to create chunks that are semantically coherent (each chunk contains a complete thought or topic), appropriately sized (large enough for context, small enough for precision), and well-overlapped (important information at boundaries isn't lost).

Chunking Strategy Quick Guide

  • Fixed-Size: Simple, predictable — good baseline for homogeneous documents
  • Recursive Character: Splits on natural boundaries (paragraphs, sentences) — best general-purpose
  • Semantic: Uses embeddings to find natural topic boundaries — best quality, higher cost
  • Parent-Child: Small chunks for retrieval, large chunks for context — best of both worlds
  • Document-Specific: Markdown headers, code functions, HTML tags — best for structured docs

1. Fixed-Size Chunking

The simplest approach — split text into chunks of a fixed number of characters or tokens, with optional overlap.

// Fixed-size chunking
function fixedSizeChunk(
  text: string,
  chunkSize: number = 1000,
  overlap: number = 200
): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + chunkSize, text.length);
    chunks.push(text.slice(start, end));
    start += chunkSize - overlap;
  }

  return chunks;
}

// Token-based chunking (more accurate for LLMs)
import { encoding_for_model } from "tiktoken";

function tokenBasedChunk(
  text: string,
  maxTokens: number = 512,
  overlapTokens: number = 50
): string[] {
  const encoder = encoding_for_model("gpt-4");
  const tokens = encoder.encode(text);
  const chunks: string[] = [];
  let start = 0;

  while (start < tokens.length) {
    const end = Math.min(start + maxTokens, tokens.length);
    const chunkTokens = tokens.slice(start, end);
    chunks.push(new TextDecoder().decode(encoder.decode(chunkTokens)));
    start += maxTokens - overlapTokens;
  }

  encoder.free();
  return chunks;
}

2. Recursive Character Splitting

The most popular general-purpose strategy. It tries to split on natural boundaries (paragraphs, then sentences, then words) while respecting a maximum chunk size.

# Recursive character splitting with LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Standard configuration for most documents
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # Target chunk size in characters
    chunk_overlap=200,     # Overlap between consecutive chunks
    length_function=len,   # How to measure chunk length
    separators=[
        "\n\n",    # First try: split on double newlines (paragraphs)
        "\n",      # Then: single newlines
        ". ",      # Then: sentences
        ", ",      # Then: clauses
        " ",       # Then: words
        "",        # Last resort: characters
    ],
    is_separator_regex=False,
)

text = """
Machine Learning Fundamentals

Machine learning is a subset of artificial intelligence that enables
systems to learn and improve from experience. It focuses on developing
algorithms that can access data and use it to learn for themselves.

Types of Machine Learning

Supervised learning uses labeled datasets to train algorithms to classify
data or predict outcomes. The model learns by comparing its output with
the correct answers during training.

Unsupervised learning finds hidden patterns in data without labeled
responses. Clustering and dimensionality reduction are common techniques.
"""

chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i} ({len(chunk)} chars): {chunk[:80]}...")

3. Semantic Chunking

Semantic chunking uses embeddings to detect topic boundaries. Adjacent sentences with similar embeddings stay together; when similarity drops significantly, a new chunk begins.

# Semantic chunking using embeddings
from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunk(
    text: str,
    threshold: float = 0.5,
    min_chunk_size: int = 100,
    max_chunk_size: int = 2000,
) -> list[str]:
    """Split text into semantically coherent chunks."""
    model = SentenceTransformer("all-MiniLM-L6-v2")

    # Split into sentences
    sentences = [s.strip() for s in text.split(". ") if s.strip()]

    if len(sentences) <= 1:
        return [text]

    # Embed all sentences
    embeddings = model.encode(sentences)

    # Calculate cosine similarity between consecutive sentences
    similarities = []
    for i in range(len(embeddings) - 1):
        sim = np.dot(embeddings[i], embeddings[i + 1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i + 1])
        )
        similarities.append(sim)

    # Find split points where similarity drops below threshold
    chunks = []
    current_chunk = [sentences[0]]

    for i, sim in enumerate(similarities):
        current_text = ". ".join(current_chunk)

        if sim < threshold and len(current_text) >= min_chunk_size:
            chunks.append(current_text + ".")
            current_chunk = [sentences[i + 1]]
        elif len(current_text) >= max_chunk_size:
            chunks.append(current_text + ".")
            current_chunk = [sentences[i + 1]]
        else:
            current_chunk.append(sentences[i + 1])

    if current_chunk:
        chunks.append(". ".join(current_chunk) + ".")

    return chunks

# LangChain also provides SemanticChunker
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

semantic_splitter = SemanticChunker(
    OpenAIEmbeddings(model="text-embedding-3-small"),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90,
)

chunks = semantic_splitter.split_text(text)
for chunk in chunks:
    print(f"[{len(chunk)} chars] {chunk[:100]}...")

4. Parent-Child (Hierarchical) Chunking

This strategy creates small chunks for precise retrieval but returns the larger parent chunk (or full document section) for context. You get the precision of small chunks with the context of large ones.

// Parent-child chunking strategy
interface ParentChunk {
  id: string;
  content: string;
  children: ChildChunk[];
}

interface ChildChunk {
  id: string;
  content: string;
  parentId: string;
}

function parentChildChunk(
  text: string,
  parentSize: number = 2000,
  childSize: number = 400,
  childOverlap: number = 50
): { parents: ParentChunk[]; children: ChildChunk[] } {
  const parents: ParentChunk[] = [];
  const children: ChildChunk[] = [];

  // Create parent chunks (large)
  const parentTexts = splitBySize(text, parentSize, 200);

  parentTexts.forEach((parentText, pi) => {
    const parentId = `parent_${pi}`;
    const parent: ParentChunk = {
      id: parentId,
      content: parentText,
      children: [],
    };

    // Create child chunks (small) within each parent
    const childTexts = splitBySize(parentText, childSize, childOverlap);
    childTexts.forEach((childText, ci) => {
      const child: ChildChunk = {
        id: `child_${pi}_${ci}`,
        content: childText,
        parentId,
      };
      parent.children.push(child);
      children.push(child);
    });

    parents.push(parent);
  });

  return { parents, children };
}

// During retrieval: search children, return parents
async function retrieveWithParentContext(
  query: string,
  childCollection: any,
  parentMap: Map<string, string>,
  k: number = 3
): Promise<string[]> {
  // Search over small child chunks (precise matching)
  const results = await childCollection.query({
    queryTexts: [query],
    nResults: k,
  });

  // Return the larger parent chunks (rich context)
  const parentIds = new Set(
    results.metadatas[0].map((m: any) => m.parentId)
  );

  return [...parentIds].map(id => parentMap.get(id) || "");
}

5. Document-Specific Chunking

# Markdown header-based chunking
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_text = """
# Introduction
This is the introduction section.

## Background
Here is some background information with important context.

## Methods
### Data Collection
We collected data from multiple sources.

### Analysis
Statistical analysis was performed using Python.

# Results
The results show significant improvement.
"""

headers_to_split = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split)
chunks = splitter.split_text(markdown_text)

for chunk in chunks:
    print(f"Headers: {chunk.metadata} -> {chunk.page_content[:60]}...")


# Code-aware chunking
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100,
)

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS,
    chunk_size=1000,
    chunk_overlap=100,
)

Choosing Chunk Size

Chunk Size Guidelines

Document Type Chunk Size Overlap Strategy
General text500-1000 chars100-200Recursive
Technical docs1000-1500 chars200-300Markdown-aware
Legal/medical800-1200 chars200Semantic
Code files1000-2000 chars100Language-aware
Q&A / FAQ200-500 chars0-50Fixed or by item

Chunking Best Practices

  • Always include metadata: Store source file, page number, section header, and chunk index with each chunk
  • Test with real queries: The best chunk size depends on your actual questions — experiment and measure
  • Use overlap: 10-20% overlap prevents losing information at chunk boundaries
  • Preserve structure: Don't split in the middle of tables, code blocks, or lists
  • Consider augmenting chunks: Add a summary or title to each chunk for better embedding quality

Summary

Chunking is the foundation of RAG quality. Start with recursive character splitting as your baseline, measure retrieval quality, then try semantic or parent-child chunking if needed. The right strategy depends on your document types, query patterns, and quality requirements. Always validate with real queries and iterate.

Continue Learning