What It Contains
Every framework uses a similar Document object:
page_content — The text content. This is what gets embedded and searched. Should be clean, readable, and self-contained.
metadata — A dictionary of key-value pairs. Used for filtering, citation, and debugging. Not embedded (usually), but stored alongside the vector in the vector store.
# A well-structured Document
Document(
page_content="""Section 3.2: Refund Policy
Customers may request a full refund within
14 calendar days of purchase. After 14 days,
a 20% restocking fee applies. Digital products
are non-refundable once downloaded.
To request a refund, contact support@company.com
or call 1-800-555-0123.""",
metadata={
"source": "policies/refunds.pdf",
"page": 7,
"section": "3.2",
"title": "Refund Policy",
"doc_type": "policy",
"department": "customer_support",
"last_updated": "2025-01-15"
}
)
This Document flows through the entire pipeline. It gets chunked (split into smaller Documents), embedded (vector added), stored (in the vector DB), and retrieved (returned to the LLM). Good metadata at this stage pays dividends at every later stage.