Ch 3: Chunking Strategies

Ch 3 — Chunking Strategies

Breaking documents into retrieval-friendly pieces

Index Under the Hood →

High Level

description

Document

arrow_forward

straighten

Fixed Split

arrow_forward

account_tree

Recursive

arrow_forward

psychology

Semantic

arrow_forward

family_history

Parent-Child

arrow_forward

tune

Tune Size

arrow_forward

check_circle

Chunks

Click play or press Space to begin the journey...

Step- / 7

content_cut

Why Chunking Matters

Chunks are the unit of retrieval — get them wrong and everything downstream fails

The Core Problem

Documents are too long to embed as a single vector. A 50-page PDF can't fit in a single embedding. Even if it could, a single vector for 50 pages would be too vague to match specific questions. Chunking splits documents into smaller pieces that each capture a focused topic.

Why Size Matters

Too large: Chunks contain multiple topics. When retrieved, the LLM gets noise alongside the answer — diluting relevance and wasting context window tokens.

Too small: Chunks lose context. A sentence like "The company exceeded targets" means nothing without knowing which company and which targets.

Too Large (5000 tokens)

Contains the answer but buried in 4 unrelated paragraphs. LLM may miss it or get confused by contradicting info in the same chunk.

Right Size (300-500 tokens)

Focused on one topic. High relevance score when matched. LLM gets clean context with minimal noise.

Too Small (50 tokens)

Lost all context. "Revenue was $4.2M" — for which quarter? Which division? The LLM can't answer accurately.

Chunking is the highest-leverage optimization in RAG. Before tuning embeddings, rerankers, or prompts — get your chunks right. Bad chunks make everything else irrelevant.

straighten

Fixed-Size Chunking

The simplest approach: split by character or token count

How It Works

Split the document into chunks of a fixed size (e.g., 500 characters or 256 tokens). Add overlap between consecutive chunks so that sentences at chunk boundaries aren't lost. Typical overlap: 10-20% of chunk size.

When to Use

Good for: Homogeneous text without clear structure (e.g., transcripts, chat logs, plain text). Quick prototyping when you want a baseline.

Bad for: Structured documents where you want to preserve section boundaries. Splits mid-sentence and mid-paragraph without regard for meaning.

# LangChain — Fixed-size by character count from langchain_text_splitters import ( CharacterTextSplitter ) splitter = CharacterTextSplitter( chunk_size=500, # max characters per chunk chunk_overlap=50, # overlap between chunks separator="\n" # try to split on newlines ) chunks = splitter.split_documents(docs) # → list of Document objects, each ≤500 chars

Character count ≠ token count. 500 characters is roughly 100-125 tokens. If your LLM has a 4K context window and you retrieve 5 chunks, use ~500 tokens per chunk max. Always think in tokens, not characters.

account_tree

Recursive Character Splitting

The most popular strategy — LangChain's default splitter

The Idea

Instead of splitting on a single separator, try a cascade of separators from most to least meaningful. First try to split on double newlines (paragraph boundaries). If chunks are still too big, split on single newlines. Then on sentences. Then on spaces. This preserves as much structure as possible.

Separator Cascade

1. \n\n — Paragraph boundaries
2. \n — Line breaks
3. . — Sentence endings
4. — Word boundaries
5. "" — Character-level (last resort)

# LangChain — Recursive splitting from langchain_text_splitters import ( RecursiveCharacterTextSplitter ) splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", ". ", " ", ""] ) chunks = splitter.split_documents(docs) # Tries \n\n first, falls back to \n, etc.

This is the recommended default. Start here. It works well for most document types because it respects natural text boundaries. Only switch to semantic or parent-child chunking when you have evidence that recursive splitting is hurting retrieval quality.

psychology

Semantic Chunking

Split by meaning, not by character count

How It Works

1. Split the document into sentences.
2. Embed each sentence (or group of sentences).
3. Compare consecutive sentence embeddings using cosine similarity.
4. When similarity drops below a threshold, insert a chunk boundary.

The result: chunks that each cover a coherent topic, regardless of character count.

Trade-offs

Pros: Chunks are semantically coherent. No mid-topic splits. Variable chunk sizes that match the natural structure of the text.

Cons: Requires an embedding model at chunking time (adds cost and latency). Chunk sizes vary widely — some may be very small or very large. Harder to debug than fixed-size.

# LlamaIndex — Semantic chunking from llama_index.core.node_parser import ( SemanticSplitterNodeParser ) from llama_index.embeddings.openai import ( OpenAIEmbedding ) splitter = SemanticSplitterNodeParser( embed_model=OpenAIEmbedding(), buffer_size=1, # sentences per group breakpoint_percentile_threshold=95 ) nodes = splitter.get_nodes_from_documents(docs) # Chunks split where topic changes

Use semantic chunking when topic boundaries matter. Research papers, legal documents, and technical manuals benefit most — they have clear topic shifts that character-based splitting misses.

family_history

Parent-Child / Hierarchical Chunking

Small chunks for retrieval, large chunks for context

The Best of Both Worlds

The chunk-size dilemma: small chunks match queries better, but large chunks give the LLM more context. Parent-child chunking solves this by creating two levels:

Child chunks (small, ~200 tokens) — used for embedding and retrieval. They match specific queries precisely.

Parent chunks (large, ~2000 tokens) — stored alongside. When a child matches, the parent is sent to the LLM, giving it full surrounding context.

# LangChain — Parent-child retrieval from langchain.retrievers import ( ParentDocumentRetriever ) retriever = ParentDocumentRetriever( vectorstore=chroma_store, docstore=InMemoryStore(), child_splitter=RecursiveCharacterTextSplitter( chunk_size=200 # small for retrieval ), parent_splitter=RecursiveCharacterTextSplitter( chunk_size=2000 # large for context ), ) retriever.add_documents(docs) # Search matches child → returns parent

Parent-child is the go-to for production RAG. It consistently outperforms single-level chunking in benchmarks. The child handles precision (finding the right spot), the parent handles recall (giving the LLM enough context to answer well).

tune

Chunk Size Trade-offs

The 256 vs 512 vs 1024 debate

Size Guidelines

128-256 tokens: Best for precise Q&A. Each chunk answers one specific question. High retrieval precision, but may miss broader context.

256-512 tokens: The sweet spot for most use cases. Enough context for the LLM to understand, small enough for focused retrieval.

512-1024 tokens: Good for summarization tasks or when documents have long, interconnected paragraphs. Lower retrieval precision but richer context.

Overlap Guidelines

10-20% of chunk size is the standard overlap. For 500-token chunks, use 50-100 tokens of overlap. Overlap ensures sentences at chunk boundaries appear in both chunks, preventing information loss.

No overlap is fine for parent-child setups (the parent provides the context) or when chunks are split at natural boundaries (section headings).

Context Window Budget

If you retrieve k=5 chunks and each is 500 tokens, that's 2,500 tokens of context. Add the system prompt (~200 tokens), user query (~50 tokens), and leave room for the answer (~500 tokens). Total: ~3,250 tokens. Make sure this fits your model's context window.

There is no universal best chunk size. It depends on your documents, your queries, and your embedding model. The only way to find the optimum is to test multiple sizes and measure retrieval quality on your actual data.

verified

Chunking Best Practices

Practical advice from production RAG systems

Always Do

Inspect your chunks. Print 20 random chunks and read them. Can a human understand each one without extra context? If not, your LLM won't either.

Preserve metadata. Every chunk should carry its source document, page number, section title, and any other metadata from the loading stage. This enables filtered retrieval and citation.

Test on real queries. Take 10 questions your users actually ask. Retrieve chunks for each. Are the right chunks being returned? This is more valuable than any theoretical optimization.

Common Mistakes

Ignoring document structure. If your documents have headings, use them as chunk boundaries. Don't split a "Refund Policy" section across 3 chunks that also contain parts of "Shipping Policy."

One strategy for all documents. PDFs, Markdown, and HTML have different structures. Use format-specific splitters when possible (e.g., MarkdownHeaderTextSplitter for Markdown).

Never re-evaluating. As your document corpus grows or user queries change, your optimal chunk size may shift. Re-test periodically.

Start simple, measure, iterate. Begin with RecursiveCharacterTextSplitter at 500 tokens with 100 overlap. Measure retrieval quality. Only add complexity (semantic, parent-child) when you have evidence the simple approach is failing.