Ch 3 — Chunking Strategies

Breaking documents into retrieval-friendly pieces
High Level
description
Document
arrow_forward
straighten
Fixed Split
arrow_forward
account_tree
Recursive
arrow_forward
psychology
Semantic
arrow_forward
family_history
Parent-Child
arrow_forward
tune
Tune Size
arrow_forward
check_circle
Chunks
-
Click play or press Space to begin the journey...
Step- / 7
content_cut
Why Chunking Matters
Chunks are the unit of retrieval — get them wrong and everything downstream fails
The Core Problem
Documents are too long to embed as a single vector. A 50-page PDF can't fit in a single embedding. Even if it could, a single vector for 50 pages would be too vague to match specific questions. Chunking splits documents into smaller pieces that each capture a focused topic.
Why Size Matters
Too large: Chunks contain multiple topics. When retrieved, the LLM gets noise alongside the answer — diluting relevance and wasting context window tokens.

Too small: Chunks lose context. A sentence like "The company exceeded targets" means nothing without knowing which company and which targets.
Too Large (5000 tokens)
Contains the answer but buried in 4 unrelated paragraphs. LLM may miss it or get confused by contradicting info in the same chunk.
Right Size (300-500 tokens)
Focused on one topic. High relevance score when matched. LLM gets clean context with minimal noise.
Too Small (50 tokens)
Lost all context. "Revenue was $4.2M" — for which quarter? Which division? The LLM can't answer accurately.
Chunking is the highest-leverage optimization in RAG. Before tuning embeddings, rerankers, or prompts — get your chunks right. Bad chunks make everything else irrelevant.
straighten
Fixed-Size Chunking
The simplest approach: split by character or token count
How It Works
Split the document into chunks of a fixed size (e.g., 500 characters or 256 tokens). Add overlap between consecutive chunks so that sentences at chunk boundaries aren't lost. Typical overlap: 10-20% of chunk size.
When to Use
Good for: Homogeneous text without clear structure (e.g., transcripts, chat logs, plain text). Quick prototyping when you want a baseline.

Bad for: Structured documents where you want to preserve section boundaries. Splits mid-sentence and mid-paragraph without regard for meaning.
# LangChain — Fixed-size by character count from langchain_text_splitters import ( CharacterTextSplitter ) splitter = CharacterTextSplitter( chunk_size=500, # max characters per chunk chunk_overlap=50, # overlap between chunks separator="\n" # try to split on newlines ) chunks = splitter.split_documents(docs) # → list of Document objects, each ≤500 chars
Character count ≠ token count. 500 characters is roughly 100-125 tokens. If your LLM has a 4K context window and you retrieve 5 chunks, use ~500 tokens per chunk max. Always think in tokens, not characters.
account_tree
Recursive Character Splitting
The most popular strategy — LangChain's default splitter
The Idea
Instead of splitting on a single separator, try a cascade of separators from most to least meaningful. First try to split on double newlines (paragraph boundaries). If chunks are still too big, split on single newlines. Then on sentences. Then on spaces. This preserves as much structure as possible.
Separator Cascade
1. \n\n — Paragraph boundaries
2. \n — Line breaks
3. . — Sentence endings
4. — Word boundaries
5. "" — Character-level (last resort)
# LangChain — Recursive splitting from langchain_text_splitters import ( RecursiveCharacterTextSplitter ) splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", ". ", " ", ""] ) chunks = splitter.split_documents(docs) # Tries \n\n first, falls back to \n, etc.
This is the recommended default. Start here. It works well for most document types because it respects natural text boundaries. Only switch to semantic or parent-child chunking when you have evidence that recursive splitting is hurting retrieval quality.
psychology
Semantic Chunking
Split by meaning, not by character count
How It Works
1. Split the document into sentences.
2. Embed each sentence (or group of sentences).
3. Compare consecutive sentence embeddings using cosine similarity.
4. When similarity drops below a threshold, insert a chunk boundary.

The result: chunks that each cover a coherent topic, regardless of character count.
Trade-offs
Pros: Chunks are semantically coherent. No mid-topic splits. Variable chunk sizes that match the natural structure of the text.

Cons: Requires an embedding model at chunking time (adds cost and latency). Chunk sizes vary widely — some may be very small or very large. Harder to debug than fixed-size.
# LlamaIndex — Semantic chunking from llama_index.core.node_parser import ( SemanticSplitterNodeParser ) from llama_index.embeddings.openai import ( OpenAIEmbedding ) splitter = SemanticSplitterNodeParser( embed_model=OpenAIEmbedding(), buffer_size=1, # sentences per group breakpoint_percentile_threshold=95 ) nodes = splitter.get_nodes_from_documents(docs) # Chunks split where topic changes
Use semantic chunking when topic boundaries matter. Research papers, legal documents, and technical manuals benefit most — they have clear topic shifts that character-based splitting misses.
family_history
Parent-Child / Hierarchical Chunking
Small chunks for retrieval, large chunks for context
The Best of Both Worlds
The chunk-size dilemma: small chunks match queries better, but large chunks give the LLM more context. Parent-child chunking solves this by creating two levels:

Child chunks (small, ~200 tokens) — used for embedding and retrieval. They match specific queries precisely.

Parent chunks (large, ~2000 tokens) — stored alongside. When a child matches, the parent is sent to the LLM, giving it full surrounding context.
# LangChain — Parent-child retrieval from langchain.retrievers import ( ParentDocumentRetriever ) retriever = ParentDocumentRetriever( vectorstore=chroma_store, docstore=InMemoryStore(), child_splitter=RecursiveCharacterTextSplitter( chunk_size=200 # small for retrieval ), parent_splitter=RecursiveCharacterTextSplitter( chunk_size=2000 # large for context ), ) retriever.add_documents(docs) # Search matches child → returns parent
Parent-child is the go-to for production RAG. It consistently outperforms single-level chunking in benchmarks. The child handles precision (finding the right spot), the parent handles recall (giving the LLM enough context to answer well).
tune
Chunk Size Trade-offs
The 256 vs 512 vs 1024 debate
Size Guidelines
128-256 tokens: Best for precise Q&A. Each chunk answers one specific question. High retrieval precision, but may miss broader context.

256-512 tokens: The sweet spot for most use cases. Enough context for the LLM to understand, small enough for focused retrieval.

512-1024 tokens: Good for summarization tasks or when documents have long, interconnected paragraphs. Lower retrieval precision but richer context.
Overlap Guidelines
10-20% of chunk size is the standard overlap. For 500-token chunks, use 50-100 tokens of overlap. Overlap ensures sentences at chunk boundaries appear in both chunks, preventing information loss.

No overlap is fine for parent-child setups (the parent provides the context) or when chunks are split at natural boundaries (section headings).
Context Window Budget
If you retrieve k=5 chunks and each is 500 tokens, that's 2,500 tokens of context. Add the system prompt (~200 tokens), user query (~50 tokens), and leave room for the answer (~500 tokens). Total: ~3,250 tokens. Make sure this fits your model's context window.
There is no universal best chunk size. It depends on your documents, your queries, and your embedding model. The only way to find the optimum is to test multiple sizes and measure retrieval quality on your actual data.
verified
Chunking Best Practices
Practical advice from production RAG systems
Always Do
Inspect your chunks. Print 20 random chunks and read them. Can a human understand each one without extra context? If not, your LLM won't either.

Preserve metadata. Every chunk should carry its source document, page number, section title, and any other metadata from the loading stage. This enables filtered retrieval and citation.

Test on real queries. Take 10 questions your users actually ask. Retrieve chunks for each. Are the right chunks being returned? This is more valuable than any theoretical optimization.
Common Mistakes
Ignoring document structure. If your documents have headings, use them as chunk boundaries. Don't split a "Refund Policy" section across 3 chunks that also contain parts of "Shipping Policy."

One strategy for all documents. PDFs, Markdown, and HTML have different structures. Use format-specific splitters when possible (e.g., MarkdownHeaderTextSplitter for Markdown).

Never re-evaluating. As your document corpus grows or user queries change, your optimal chunk size may shift. Re-test periodically.
Start simple, measure, iterate. Begin with RecursiveCharacterTextSplitter at 500 tokens with 100 overlap. Measure retrieval quality. Only add complexity (semantic, parent-child) when you have evidence the simple approach is failing.