Ch 3: Chunking Strategies — Under the Hood

Ch 3 — Chunking Strategies — Under the Hood

Splitter internals, token counting, semantic breakpoints, and overlap mechanics

Index ← High Level

Under the Hood

Click play or press Space to begin...

Step- / 10

ACharacterTextSplitter InternalsThe simplest splitter under the hood

description

Full TextRaw document
page_content string

split on separator

view_list

SegmentsSplit by \n or
custom separator

merge to size

join

MergeCombine segments
until chunk_size

arrow_downward Recursive splitting: the separator cascade

BRecursiveCharacterTextSplitterThe _split_text recursion

sort

Try \n\nParagraph
boundaries

too big?

wrap_text

Try \nLine
breaks

too big?

short_text

Try ". "Sentence
endings

last resort

space_bar

Try " "Word
boundaries

arrow_downward Token-aware splitting with tiktoken

CToken-Aware SplittingWhy character count is not token count

text_fields

Text"Hello world"
= 11 chars

tokenize

data_array

Tokens["Hello", " world"]
= 2 tokens

count

Token LengthExact count for
chunk_size limit

arrow_downward Semantic chunking: embedding-based breakpoints

DSemantic Chunking InternalsCosine similarity breakpoints between sentences

format_list_numbered

SentencesSplit text into
individual sentences

embed

scatter_plot

VectorsEmbed each
sentence

compare

compare_arrows

SimilarityCosine between
neighbors

arrow_downward Parent-child architecture and retrieval

EParent-Child ArchitectureTwo-level chunking with linked retrieval

description

DocumentOriginal full
document

parent split

view_agenda

ParentsLarge chunks
~2000 tokens

child split

grid_view

ChildrenSmall chunks
~200 tokens

arrow_downward Overlap mechanics and metadata propagation

FOverlap & Metadata PropagationWhat happens at chunk boundaries

join

OverlapLast N chars of
chunk N copied to N+1

propagate

label

MetadataSource, page, index
copied to each chunk

output

check_circle

Final ChunksReady for
embedding