The Idea
Instead of splitting on a single separator, try a cascade of separators from most to least meaningful. First try to split on double newlines (paragraph boundaries). If chunks are still too big, split on single newlines. Then on sentences. Then on spaces. This preserves as much structure as possible.
Separator Cascade
1. \n\n — Paragraph boundaries
2. \n — Line breaks
3. . — Sentence endings
4. — Word boundaries
5. "" — Character-level (last resort)
# LangChain — Recursive splitting
from langchain_text_splitters import (
RecursiveCharacterTextSplitter
)
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(docs)
# Tries \n\n first, falls back to \n, etc.
This is the recommended default. Start here. It works well for most document types because it respects natural text boundaries. Only switch to semantic or parent-child chunking when you have evidence that recursive splitting is hurting retrieval quality.