Ch 2 — Document Loading — Under the Hood
PDF parsing internals, Unstructured pipeline, encoding, and table extraction
Under the Hood
-
Click play or press Space to begin...
APDF Parsing InternalsThe hardest problem in document loading
1picture_as_pdf
PDF FileBinary format
page-based layout
parse
build
ParserPyMuPDF, pdfplumber
PyPDF2, PDFMiner
extract
2text_fields
Raw TextText blocks
+ coordinates
3arrow_downward Structure-aware parsing with Unstructured
BUnstructured Library PipelineDetecting document structure automatically
input
partition()Auto-detect format
route to parser
detects
category
ElementsTitle, NarrativeText
Table, ListItem
enrich
4label
MetadataCoordinates, page
section, category
5arrow_downward Table extraction strategies
CTable ExtractionThe second hardest problem in document loading
table_chart
PDF TableNo actual table
structure in PDF
detect
grid_on
Table Detectionpdfplumber, Camelot
or vision models
convert
6code
Text FormatMarkdown table
or key-value pairs
7arrow_downward HTML extraction and web scraping
DHTML Extraction & Web ScrapingGetting clean text from web pages
language
Raw HTMLNav, ads, scripts
cookie banners
extract
filter_alt
Content Extracttrafilatura
BeautifulSoup
output
article
Clean TextMain content
+ metadata
8arrow_downward OCR for scanned documents
EOCR & Scanned DocumentsWhen PDFs are just images
image
Scanned PDFImage-only pages
no selectable text
OCR
document_scanner
TesseractOpen-source OCR
100+ languages
or
9cloud
Cloud OCRAWS Textract
Google Document AI
10arrow_downward Encoding detection and normalization
FEncoding & NormalizationMaking sure text is actually text
translate
Detect Encodingcharset-normalizer
chardet
convert
text_format
UTF-8Standard encoding
for all text
normalize
check_circle
Clean TextUnicode NFC
consistent format