Ch 2 — Document Loading & Preprocessing

Getting your data into the RAG pipeline
High Level
picture_as_pdf
Raw Files
arrow_forward
build
Parse
arrow_forward
cloud_download
Sources
arrow_forward
cleaning_services
Clean
arrow_forward
label
Metadata
arrow_forward
description
Document
arrow_forward
check_circle
Ready
-
Click play or press Space to begin the journey...
Step- / 7
warning
Garbage In, Garbage Out
Document loading is the most underestimated stage of RAG
The Problem
Your RAG system is only as good as the data it can read. A PDF that loads as garbled text, a web page full of navigation noise, or a spreadsheet with merged cells — all produce bad chunks, bad embeddings, and bad answers. Most RAG failures trace back to poor document loading.
What "Loading" Means
Document loading is the process of reading raw files (PDFs, HTML, DOCX, CSVs, etc.) and converting them into clean, structured text with metadata. The output is a list of Document objects — each containing page_content (the text) and metadata (source, page number, title, etc.).
# The Document object — the universal unit class Document: page_content: str # the actual text metadata: dict # source, page, title, ... # Example output from a PDF loader: Document( page_content="Section 3.2: Refund Policy\n\n Customers may request a refund within 14 days of purchase...", metadata={ "source": "policies/refunds.pdf", "page": 7, "title": "Company Policies 2025" } )
Metadata is critical for RAG. It enables filtering ("only search HR docs"), citation ("this answer comes from page 7 of refunds.pdf"), and debugging ("why did the system retrieve this chunk?").
folder_open
Common File Types and Their Challenges
Every format has its quirks
Structured Text
Markdown / Plain Text — Easiest to load. Text is already clean. Just read the file.

HTML — Need to strip navigation, headers, footers, ads, and scripts. Libraries like BeautifulSoup or trafilatura extract the main content.

JSON / CSV — Structured data. Each row or object can become a document. Key challenge: deciding which fields to include as text vs metadata.
Complex Formats
PDF — The hardest and most common format. Text extraction varies wildly: some PDFs have selectable text, others are scanned images. Tables, multi-column layouts, headers/footers, and embedded images all cause problems.

DOCX / PPTX — Microsoft Office formats. XML-based internally. Need to handle styles, tables, embedded images, and speaker notes.

Scanned Documents — Require OCR (Optical Character Recognition) to convert images to text. Tesseract (open-source) or cloud OCR services.
PDFs are the #1 pain point in RAG. If your corpus is mostly PDFs, invest heavily in your PDF parsing pipeline. The difference between a good and bad PDF parser can be the difference between a working and broken RAG system.
extension
Document Loaders
Libraries and tools that read files into Document objects
Framework Loaders
LangChain has 160+ document loaders — one for almost every format and source. PyPDFLoader, UnstructuredHTMLLoader, CSVLoader, GitLoader, NotionDBLoader, etc.

LlamaIndex has SimpleDirectoryReader that auto-detects file types, plus specialized readers for databases, APIs, and cloud storage via llama-hub.
Specialized Tools
Unstructured — Open-source library that handles PDFs, DOCX, HTML, images, and more. Detects document structure (titles, paragraphs, tables, lists). Used by both LangChain and LlamaIndex under the hood.

Docling (IBM) — Open-source document parser with advanced PDF understanding, table extraction, and layout analysis.
# LangChain — load a PDF from langchain_community.document_loaders import ( PyPDFLoader ) loader = PyPDFLoader("policies/refunds.pdf") docs = loader.load() # → list of Document objects, one per page # LlamaIndex — load a whole directory from llama_index.core import SimpleDirectoryReader reader = SimpleDirectoryReader("./data") docs = reader.load_data() # → auto-detects PDFs, DOCX, TXT, etc. # Unstructured — structure-aware parsing from unstructured.partition.pdf import ( partition_pdf ) elements = partition_pdf("report.pdf") # → Title, NarrativeText, Table, ListItem, ...
Start with framework loaders, upgrade when needed. LangChain's PyPDFLoader or LlamaIndex's SimpleDirectoryReader work for most cases. When you hit quality issues (bad table extraction, garbled text), switch to Unstructured or Docling for that specific format.
cloud
Data Sources Beyond Files
Databases, APIs, and cloud services
Databases
SQL databases — Query rows and convert them to documents. Each row (or group of rows) becomes a Document with column values as text and table/column names as metadata.

NoSQL / Document DBs — MongoDB, Elasticsearch, etc. Each document or record maps naturally to a RAG Document object.
APIs and SaaS
Notion — Pages and databases via the Notion API
Confluence — Wiki pages and spaces
Google Drive — Docs, Sheets, Slides
Slack — Channel messages and threads
GitHub — Code, issues, PRs, READMEs

Both LangChain and LlamaIndex have loaders for all of these.
Web Scraping
Web pages — Crawl and extract content from websites. Tools like trafilatura, BeautifulSoup, or Firecrawl handle extraction. Key challenge: extracting the main content while stripping navigation, ads, and boilerplate.

Sitemaps — Use a sitemap to discover all pages on a site, then load each one. LangChain has SitemapLoader for this.
Think about freshness. Files are loaded once. But databases, APIs, and web pages change constantly. You need a strategy for re-indexing: scheduled (nightly), event-driven (webhook on change), or incremental (only re-index what changed). Chapter 11 covers this in detail.
cleaning_services
Preprocessing & Cleaning
Turning raw text into clean, embeddable content
Common Cleaning Steps
Remove noise — Headers, footers, page numbers, watermarks, copyright notices. These repeat on every page and pollute embeddings.

Normalize whitespace — Collapse multiple newlines, fix encoding issues (curly quotes, em dashes, Unicode artifacts).

Remove boilerplate — Navigation menus, cookie banners, "click here to subscribe" text from web pages.

Handle tables — Convert tables to a text format the LLM can understand (Markdown tables, or "Column: Value" pairs).
Metadata Enrichment
Add useful metadata during preprocessing:

Source tracking — File path, URL, database table
Timestamps — When the document was created/modified
Document structure — Section titles, heading hierarchy
Custom tags — Department, product, document type

This metadata enables filtered retrieval later: "search only in HR documents from 2025."
Spend time here. A few hours improving your preprocessing pipeline often yields bigger RAG quality improvements than switching to a fancier embedding model or retrieval strategy. Clean data beats clever algorithms.
data_object
The Document Object
The universal data structure that flows through the pipeline
What It Contains
Every framework uses a similar Document object:

page_content — The text content. This is what gets embedded and searched. Should be clean, readable, and self-contained.

metadata — A dictionary of key-value pairs. Used for filtering, citation, and debugging. Not embedded (usually), but stored alongside the vector in the vector store.
# A well-structured Document Document( page_content="""Section 3.2: Refund Policy Customers may request a full refund within 14 calendar days of purchase. After 14 days, a 20% restocking fee applies. Digital products are non-refundable once downloaded. To request a refund, contact support@company.com or call 1-800-555-0123.""", metadata={ "source": "policies/refunds.pdf", "page": 7, "section": "3.2", "title": "Refund Policy", "doc_type": "policy", "department": "customer_support", "last_updated": "2025-01-15" } )
This Document flows through the entire pipeline. It gets chunked (split into smaller Documents), embedded (vector added), stored (in the vector DB), and retrieved (returned to the LLM). Good metadata at this stage pays dividends at every later stage.
checklist
Loading Best Practices
Lessons learned from production RAG systems
Do
Inspect your raw output. Before chunking or embedding, look at what the loader actually produced. You'll catch encoding issues, missing text, and garbled tables early.

Preserve document structure. Keep section headings, list formatting, and paragraph breaks. They help the LLM understand context.

Track provenance. Always store the source file, page number, and any other metadata that helps you trace an answer back to its origin.

Test with your actual data. Don't assume a loader works well just because the docs say so. Load 10 representative documents and check the output manually.
Don't
Don't load everything blindly. Exclude irrelevant files (meeting recordings transcripts of small talk, auto-generated reports, duplicate versions).

Don't ignore encoding. UTF-8 is the standard, but you'll encounter Latin-1, Windows-1252, and other encodings. Use chardet or charset-normalizer to detect and convert.

Don't skip tables. Tables often contain the most important information (pricing, specifications, policies). Convert them to a text format the LLM can parse.
Next up: Chunking. Once your documents are loaded and clean, the next step is splitting them into retrieval-friendly chunks. That's Chapter 3 — where chunk size, overlap, and splitting strategy determine how well your retrieval works.