Ch 2: Document Loading & Preprocessing

Ch 2 — Document Loading & Preprocessing

Getting your data into the RAG pipeline

Index Under the Hood →

High Level

picture_as_pdf

Raw Files

arrow_forward

build

Parse

arrow_forward

cloud_download

Sources

arrow_forward

cleaning_services

Clean

arrow_forward

label

Metadata

arrow_forward

description

Document

arrow_forward

check_circle

Ready

Click play or press Space to begin the journey...

Step- / 7

warning

Garbage In, Garbage Out

Document loading is the most underestimated stage of RAG

The Problem

Your RAG system is only as good as the data it can read. A PDF that loads as garbled text, a web page full of navigation noise, or a spreadsheet with merged cells — all produce bad chunks, bad embeddings, and bad answers. Most RAG failures trace back to poor document loading.

What "Loading" Means

Document loading is the process of reading raw files (PDFs, HTML, DOCX, CSVs, etc.) and converting them into clean, structured text with metadata. The output is a list of Document objects — each containing page_content (the text) and metadata (source, page number, title, etc.).

# The Document object — the universal unit class Document: page_content: str # the actual text metadata: dict # source, page, title, ... # Example output from a PDF loader: Document( page_content="Section 3.2: Refund Policy\n\n Customers may request a refund within 14 days of purchase...", metadata={ "source": "policies/refunds.pdf", "page": 7, "title": "Company Policies 2025" } )

Metadata is critical for RAG. It enables filtering ("only search HR docs"), citation ("this answer comes from page 7 of refunds.pdf"), and debugging ("why did the system retrieve this chunk?").

folder_open

Common File Types and Their Challenges

Every format has its quirks

Structured Text

Markdown / Plain Text — Easiest to load. Text is already clean. Just read the file.

HTML — Need to strip navigation, headers, footers, ads, and scripts. Libraries like BeautifulSoup or trafilatura extract the main content.

JSON / CSV — Structured data. Each row or object can become a document. Key challenge: deciding which fields to include as text vs metadata.

Complex Formats

PDF — The hardest and most common format. Text extraction varies wildly: some PDFs have selectable text, others are scanned images. Tables, multi-column layouts, headers/footers, and embedded images all cause problems.

DOCX / PPTX — Microsoft Office formats. XML-based internally. Need to handle styles, tables, embedded images, and speaker notes.

Scanned Documents — Require OCR (Optical Character Recognition) to convert images to text. Tesseract (open-source) or cloud OCR services.

PDFs are the #1 pain point in RAG. If your corpus is mostly PDFs, invest heavily in your PDF parsing pipeline. The difference between a good and bad PDF parser can be the difference between a working and broken RAG system.

extension

Document Loaders

Libraries and tools that read files into Document objects

Framework Loaders

LangChain has 160+ document loaders — one for almost every format and source. PyPDFLoader, UnstructuredHTMLLoader, CSVLoader, GitLoader, NotionDBLoader, etc.

LlamaIndex has SimpleDirectoryReader that auto-detects file types, plus specialized readers for databases, APIs, and cloud storage via llama-hub.

Specialized Tools

Unstructured — Open-source library that handles PDFs, DOCX, HTML, images, and more. Detects document structure (titles, paragraphs, tables, lists). Used by both LangChain and LlamaIndex under the hood.

Docling (IBM) — Open-source document parser with advanced PDF understanding, table extraction, and layout analysis.

# LangChain — load a PDF from langchain_community.document_loaders import ( PyPDFLoader ) loader = PyPDFLoader("policies/refunds.pdf") docs = loader.load() # → list of Document objects, one per page # LlamaIndex — load a whole directory from llama_index.core import SimpleDirectoryReader reader = SimpleDirectoryReader("./data") docs = reader.load_data() # → auto-detects PDFs, DOCX, TXT, etc. # Unstructured — structure-aware parsing from unstructured.partition.pdf import ( partition_pdf ) elements = partition_pdf("report.pdf") # → Title, NarrativeText, Table, ListItem, ...

Start with framework loaders, upgrade when needed. LangChain's PyPDFLoader or LlamaIndex's SimpleDirectoryReader work for most cases. When you hit quality issues (bad table extraction, garbled text), switch to Unstructured or Docling for that specific format.

cloud

Data Sources Beyond Files

Databases, APIs, and cloud services

Databases

SQL databases — Query rows and convert them to documents. Each row (or group of rows) becomes a Document with column values as text and table/column names as metadata.

NoSQL / Document DBs — MongoDB, Elasticsearch, etc. Each document or record maps naturally to a RAG Document object.

APIs and SaaS

Notion — Pages and databases via the Notion API
Confluence — Wiki pages and spaces
Google Drive — Docs, Sheets, Slides
Slack — Channel messages and threads
GitHub — Code, issues, PRs, READMEs

Both LangChain and LlamaIndex have loaders for all of these.

Web Scraping

Web pages — Crawl and extract content from websites. Tools like trafilatura, BeautifulSoup, or Firecrawl handle extraction. Key challenge: extracting the main content while stripping navigation, ads, and boilerplate.

Sitemaps — Use a sitemap to discover all pages on a site, then load each one. LangChain has SitemapLoader for this.

Think about freshness. Files are loaded once. But databases, APIs, and web pages change constantly. You need a strategy for re-indexing: scheduled (nightly), event-driven (webhook on change), or incremental (only re-index what changed). Chapter 11 covers this in detail.

cleaning_services

Preprocessing & Cleaning

Turning raw text into clean, embeddable content

Common Cleaning Steps

Remove noise — Headers, footers, page numbers, watermarks, copyright notices. These repeat on every page and pollute embeddings.

Normalize whitespace — Collapse multiple newlines, fix encoding issues (curly quotes, em dashes, Unicode artifacts).

Remove boilerplate — Navigation menus, cookie banners, "click here to subscribe" text from web pages.

Handle tables — Convert tables to a text format the LLM can understand (Markdown tables, or "Column: Value" pairs).

Metadata Enrichment

Add useful metadata during preprocessing:

Source tracking — File path, URL, database table
Timestamps — When the document was created/modified
Document structure — Section titles, heading hierarchy
Custom tags — Department, product, document type

This metadata enables filtered retrieval later: "search only in HR documents from 2025."

Spend time here. A few hours improving your preprocessing pipeline often yields bigger RAG quality improvements than switching to a fancier embedding model or retrieval strategy. Clean data beats clever algorithms.

data_object

The Document Object

The universal data structure that flows through the pipeline

What It Contains

Every framework uses a similar Document object:

page_content — The text content. This is what gets embedded and searched. Should be clean, readable, and self-contained.

metadata — A dictionary of key-value pairs. Used for filtering, citation, and debugging. Not embedded (usually), but stored alongside the vector in the vector store.

# A well-structured Document Document( page_content="""Section 3.2: Refund Policy Customers may request a full refund within 14 calendar days of purchase. After 14 days, a 20% restocking fee applies. Digital products are non-refundable once downloaded. To request a refund, contact support@company.com or call 1-800-555-0123.""", metadata={ "source": "policies/refunds.pdf", "page": 7, "section": "3.2", "title": "Refund Policy", "doc_type": "policy", "department": "customer_support", "last_updated": "2025-01-15" } )

This Document flows through the entire pipeline. It gets chunked (split into smaller Documents), embedded (vector added), stored (in the vector DB), and retrieved (returned to the LLM). Good metadata at this stage pays dividends at every later stage.

checklist

Loading Best Practices

Lessons learned from production RAG systems

Inspect your raw output. Before chunking or embedding, look at what the loader actually produced. You'll catch encoding issues, missing text, and garbled tables early.

Preserve document structure. Keep section headings, list formatting, and paragraph breaks. They help the LLM understand context.

Track provenance. Always store the source file, page number, and any other metadata that helps you trace an answer back to its origin.

Test with your actual data. Don't assume a loader works well just because the docs say so. Load 10 representative documents and check the output manually.

Don't

Don't load everything blindly. Exclude irrelevant files (meeting recordings transcripts of small talk, auto-generated reports, duplicate versions).

Don't ignore encoding. UTF-8 is the standard, but you'll encounter Latin-1, Windows-1252, and other encodings. Use chardet or charset-normalizer to detect and convert.

Don't skip tables. Tables often contain the most important information (pricing, specifications, policies). Convert them to a text format the LLM can parse.

Next up: Chunking. Once your documents are loaded and clean, the next step is splitting them into retrieval-friendly chunks. That's Chapter 3 — where chunk size, overlap, and splitting strategy determine how well your retrieval works.