Ch 8: Generation — Under the Hood

Ch 8 — Generation — Under the Hood

Context formatting, token budgets, chain internals, structured output, and faithfulness

Index ← High Level

Under the Hood

Click play or press Space to begin...

Step- / 10

AContext Formatting & Token BudgetsHow retrieved chunks become prompt context

description

Format Chunks

Add metadata headers

order

sort

Order Context

Relevance or source

budget

data_usage

Token Budget

Fit within window

functionsToken Math: system_prompt + context + question + max_output ≤ model_context_window

BLangChain RAG Chain InternalsHow create_retrieval_chain works under the hood

input

Input

Question string

retrieve

Retriever

Get documents

format

smart_toy

Stuff Chain

Prompt + LLM

output

Result Dict

answer + context

account_treeLCEL: Retriever | format_docs | prompt | llm | StrOutputParser — composable pipeline

CStructured Output & CitationsForcing the LLM to return typed, parseable responses

data_object

Pydantic Schema

Define output type

constrain

smart_toy

Constrained Gen

JSON mode / tools

parse

check_circle

Typed Object

Validated output

format_quoteCitation Extraction: Map each claim to source chunk via NLI or LLM grounding check

DStreaming & Async PatternsToken-by-token delivery to the frontend

cloud

LLM API

SSE stream

chunks

dns

Backend

Forward tokens

SSE

web

Frontend

Append to UI

speedLatency: TTFT ~200ms (API), ~50ms (local). Throughput: 50-100 tokens/sec (API), 30-80 t/s (local)

EFaithfulness Checking & GuardrailsVerifying the answer is grounded in the context

fact_check

NLI Check

Entailment scoring

score

shield

Guardrails

NeMo / custom

filter

verified

Safe Output

Grounded answer

FMap-Reduce & Refine ChainsHandling context that exceeds the model window

call_split

Map

Per-chunk answers

combine

merge

Reduce

Synthesize final

refresh

Refine

Iterative update