Ch 8 — Generation — Under the Hood
Context formatting, token budgets, chain internals, structured output, and faithfulness
Under the Hood
-
Click play or press Space to begin...
AContext Formatting & Token BudgetsHow retrieved chunks become prompt context
1
description
Format Chunks
Add metadata headers
order
1
sort
Order Context
Relevance or source
budget
1
data_usage
Token Budget
Fit within window
2
functionsToken Math: system_prompt + context + question + max_output ≤ model_context_window
BLangChain RAG Chain InternalsHow create_retrieval_chain works under the hood
3
input
Input
Question string
retrieve
3
search
Retriever
Get documents
format
3
smart_toy
Stuff Chain
Prompt + LLM
output
3
output
Result Dict
answer + context
4
account_treeLCEL: Retriever | format_docs | prompt | llm | StrOutputParser — composable pipeline
CStructured Output & CitationsForcing the LLM to return typed, parseable responses
5
data_object
Pydantic Schema
Define output type
constrain
5
smart_toy
Constrained Gen
JSON mode / tools
parse
5
check_circle
Typed Object
Validated output
6
format_quoteCitation Extraction: Map each claim to source chunk via NLI or LLM grounding check
DStreaming & Async PatternsToken-by-token delivery to the frontend
chunks
7
dns
Backend
Forward tokens
SSE
8
speedLatency: TTFT ~200ms (API), ~50ms (local). Throughput: 50-100 tokens/sec (API), 30-80 t/s (local)
EFaithfulness Checking & GuardrailsVerifying the answer is grounded in the context
9
fact_check
NLI Check
Entailment scoring
score
9
shield
Guardrails
NeMo / custom
filter
9
verified
Safe Output
Grounded answer
FMap-Reduce & Refine ChainsHandling context that exceeds the model window
10
call_split
Map
Per-chunk answers
combine
10
merge
Reduce
Synthesize final
or
10
refresh
Refine
Iterative update