Ch 2: Document Loading & Preprocessing

Ch 2 — Document Loading — Under the Hood

PDF parsing internals, Unstructured pipeline, encoding, and table extraction

Index ← High Level

Under the Hood

Click play or press Space to begin...

Step- / 10

APDF Parsing InternalsThe hardest problem in document loading

picture_as_pdf

PDF FileBinary format
page-based layout

parse

build

ParserPyMuPDF, pdfplumber
PyPDF2, PDFMiner

extract

text_fields

Raw TextText blocks
+ coordinates

arrow_downward Structure-aware parsing with Unstructured

BUnstructured Library PipelineDetecting document structure automatically

input

partition()Auto-detect format
route to parser

detects

category

ElementsTitle, NarrativeText
Table, ListItem

enrich

label

MetadataCoordinates, page
section, category

arrow_downward Table extraction strategies

CTable ExtractionThe second hardest problem in document loading

table_chart

PDF TableNo actual table
structure in PDF

detect

grid_on

Table Detectionpdfplumber, Camelot
or vision models

convert

code

Text FormatMarkdown table
or key-value pairs

arrow_downward HTML extraction and web scraping

DHTML Extraction & Web ScrapingGetting clean text from web pages

language

Raw HTMLNav, ads, scripts
cookie banners

extract

filter_alt

Content Extracttrafilatura
BeautifulSoup

output

article

Clean TextMain content
+ metadata

arrow_downward OCR for scanned documents

EOCR & Scanned DocumentsWhen PDFs are just images

image

Scanned PDFImage-only pages
no selectable text

OCR

document_scanner

TesseractOpen-source OCR
100+ languages

cloud

Cloud OCRAWS Textract
Google Document AI

arrow_downward Encoding detection and normalization

FEncoding & NormalizationMaking sure text is actually text

translate

Detect Encodingcharset-normalizer
chardet

convert

text_format

UTF-8Standard encoding
for all text

normalize

check_circle

Clean TextUnicode NFC
consistent format