Ch 3 — Chunking Strategies — Under the Hood

Splitter internals, token counting, semantic breakpoints, and overlap mechanics
Under the Hood
-
Click play or press Space to begin...
Step- / 10
ACharacterTextSplitter InternalsThe simplest splitter under the hood
1
description
Full TextRaw document
page_content string
split on separator
view_list
SegmentsSplit by \n or
custom separator
merge to size
2
join
MergeCombine segments
until chunk_size
3
arrow_downward Recursive splitting: the separator cascade
BRecursiveCharacterTextSplitterThe _split_text recursion
sort
Try \n\nParagraph
boundaries
too big?
wrap_text
Try \nLine
breaks
too big?
4
short_text
Try ". "Sentence
endings
last resort
space_bar
Try " "Word
boundaries
5
arrow_downward Token-aware splitting with tiktoken
CToken-Aware SplittingWhy character count is not token count
text_fields
Text"Hello world"
= 11 chars
tokenize
data_array
Tokens["Hello", " world"]
= 2 tokens
count
6
pin
Token LengthExact count for
chunk_size limit
7
arrow_downward Semantic chunking: embedding-based breakpoints
DSemantic Chunking InternalsCosine similarity breakpoints between sentences
format_list_numbered
SentencesSplit text into
individual sentences
embed
scatter_plot
VectorsEmbed each
sentence
compare
8
compare_arrows
SimilarityCosine between
neighbors
9
arrow_downward Parent-child architecture and retrieval
EParent-Child ArchitectureTwo-level chunking with linked retrieval
description
DocumentOriginal full
document
parent split
view_agenda
ParentsLarge chunks
~2000 tokens
child split
grid_view
ChildrenSmall chunks
~200 tokens
10
arrow_downward Overlap mechanics and metadata propagation
FOverlap & Metadata PropagationWhat happens at chunk boundaries
join
OverlapLast N chars of
chunk N copied to N+1
propagate
label
MetadataSource, page, index
copied to each chunk
output
check_circle
Final ChunksReady for
embedding
1
Detail