Ch 2 — How Machines See: From Pixels to Patches

Pixels, CNNs, Vision Transformers, image patching, and how images become tokens for LLMs
High Level
grid_on
Pixels
arrow_forward
filter_alt
CNN
arrow_forward
auto_awesome
Features
arrow_forward
view_module
Patches
arrow_forward
token
Tokens
arrow_forward
smart_toy
LLM
-
Click play or press Space to begin...
Step- / 8
grid_on
Pixels: The Raw Material
What a computer actually “sees”
What Is a Pixel
A pixel is a single point of color, represented as three numbers (R, G, B) ranging from 0–255. A 1024×1024 image has 1,048,576 pixels × 3 channels = 3,145,728 values. This is far too many dimensions for a model to process directly — we need a way to compress images into meaningful representations.
The Resolution Problem
// Raw pixel counts by resolution 224×224 150,528 values (ImageNet standard) 512×512 786,432 values (Stable Diffusion) 1024×1024 3,145,728 values (high-res generation) 3840×2160 24,883,200 values (4K image) // Processing this directly would require // enormous compute and memory
Color Spaces
RGB: Red, Green, Blue (most common, 3 channels)
Grayscale: Single channel, 0–255
HSV: Hue, Saturation, Value (useful for color-based tasks)
RGBA: RGB + Alpha transparency channel

Most vision models work with RGB normalized to 0.0–1.0, often with mean subtraction and standard deviation scaling per channel.
Key insight: Raw pixels are like raw audio waveforms — they contain all the information but in a form that’s hard to reason about. A model looking at pixels sees numbers, not objects. We need to extract meaningful features — edges, textures, shapes, objects — from this sea of numbers.
filter_alt
CNNs: Learning to See
Convolutional Neural Networks extract visual features hierarchically
How Convolutions Work
A CNN slides small filters (kernels) — typically 3×3 or 5×5 — across the image. Each filter detects a specific pattern. Stacking layers creates a hierarchy:

Layer 1: Edges, gradients, colors
Layer 2: Textures, corners, simple shapes
Layer 3: Parts (eyes, wheels, windows)
Layer 4+: Objects, scenes, abstract concepts

Each layer builds on the previous, creating increasingly abstract representations.
Pooling & Feature Maps
Pooling reduces spatial dimensions (e.g., max pooling takes the maximum value in each 2×2 region). This creates a compression hierarchy: a 224×224 image becomes 112×112, then 56×56, then 28×28 — each level capturing more abstract features in fewer spatial dimensions.
Key CNN Architectures
// CNN evolution (layers / year / impact) AlexNet 8 layers 2012 Won ImageNet, started DL revolution VGG 19 layers 2014 Showed depth matters GoogLeNet 22 layers 2014 Inception modules, efficiency ResNet 152 layers 2015 Skip connections, solved vanishing grad EfficientNet varies 2019 Optimal depth/width/resolution scaling ConvNeXt varies 2022 CNN modernized with Transformer tricks
Key insight: CNNs were the dominant vision architecture for a decade (2012–2022). They’re still used as vision encoders in many multimodal models, but Vision Transformers are increasingly replacing them. Understanding CNNs is essential because many concepts (feature hierarchies, pooling, skip connections) carry over to modern architectures.
auto_awesome
Feature Hierarchies
What the model actually “sees” at each level
The Feature Pyramid
Both CNNs and Vision Transformers learn hierarchical representations that mirror how the human visual cortex processes information:

Low-level (early layers): Edges, color gradients, simple textures
Mid-level (middle layers): Parts, shapes, patterns — eyes, wheels, bricks
High-level (final layers): Objects, scenes, concepts — “dog”, “beach”, “celebration”

This hierarchy emerges naturally from training — nobody explicitly teaches the model to detect edges first.
Attention Maps in ViTs
In Vision Transformers, attention maps reveal which patches the model focuses on. For a photo of a dog in a park:

Early layers: Attend to edges and textures uniformly
Middle layers: Focus on the dog’s face, body outline
Final layers: Attend to the dog as a whole object, with background suppressed

This shows the model has learned to “see” meaningful objects, not just pixel patterns.
Key insight: The features learned by vision models are remarkably similar to what neuroscientists observe in the human visual cortex (V1 → V2 → V4 → IT). Both learn edges first, then textures, then objects, then scenes. This convergence suggests these hierarchies are fundamental to visual understanding.
view_module
Vision Transformers (ViT)
Treating images as sequences of patches
The ViT Innovation
Instead of convolutions, ViT splits an image into fixed-size patches (typically 14×14 or 16×16 pixels) and treats each patch as a token. These patch tokens are linearly projected into embeddings, a [CLS] token is prepended, positional embeddings are added, and the whole sequence is processed by a standard Transformer encoder with self-attention.
Patch Math
// How images become token sequences 224×224 image with 16×16 patches: 224 ÷ 16 = 14 patches per side 14 × 14 = 196 patch tokens 1024×1024 image with 14×14 patches: 1024 ÷ 14 ≈ 73 patches per side 73 × 73 = ~5,329 patch tokens // Each patch: 14×14×3 = 588 pixel values // Projected to a d-dimensional embedding (e.g., 768)
Why ViT Won
Global attention: Every patch attends to every other patch from layer 1 (CNNs only see local 3×3 neighborhoods)
Scalability: Performance improves smoothly with more data and compute — follows scaling laws
Unification: Same Transformer architecture for vision and language — enables multimodal models
Transfer learning: Pre-trained ViTs transfer well to downstream tasks with minimal fine-tuning
Key insight: ViT showed that Transformers don’t need convolutions to understand images. The self-attention mechanism naturally learns spatial relationships between patches. A patch of sky “knows” it’s above a patch of ground because attention learns these relationships from data.
token
From Patches to LLM Tokens
How images enter language models
The Projection Step
Raw patch embeddings from ViT live in vision space. LLMs operate in language space. A projector (typically a linear layer or small MLP) maps vision embeddings into the LLM’s token embedding space. This is the bridge between seeing and understanding — after projection, visual tokens are indistinguishable from text tokens to the LLM.
Token Compression
5,329 tokens per image would consume most of an LLM’s context window. Modern VLMs use compression techniques:

Spatial pooling: Average adjacent patch tokens (2×2 → 4x reduction)
Perceiver/Q-Former: Learn a fixed number of query tokens that attend to all patches
Dynamic resolution: Adjust token count based on image complexity and task
Token Budget by Model
// How many tokens per image? GPT-4V ~765 tokens (low res) ~2,048 tokens (high res, tiled) Gemini ~258 tokens (efficient encoding) Claude ~1,600 tokens (per image) LLaVA ~576 tokens (open-source) // With 128K context window: // ~60-80 images + text can fit // Cost: ~$0.002-0.01 per image (API)
Key insight: The number of tokens per image directly affects cost, latency, and how many images fit in context. More tokens = better detail but higher cost. This is a key design tradeoff when building multimodal applications — you’re always balancing visual fidelity against token budget.
compare
CNN vs ViT: When to Use Which
Choosing the right vision backbone
CNN Advantages
Data efficiency: Works well with smaller datasets (thousands, not millions)
Inductive bias: Built-in understanding of spatial locality and translation invariance
Speed: Faster inference for small images on edge devices
Mobile: Efficient architectures like MobileNet, EfficientNet for phones and IoT
ViT Advantages
Scalability: Performance improves predictably with more data and compute
Global context: Every patch sees every other patch from layer 1
Flexibility: Handles variable resolution and aspect ratios gracefully
Unification: Same architecture as the LLM backbone — natural fit for multimodal
In Practice
Most modern multimodal models use ViT as the vision encoder because it integrates naturally with the Transformer-based LLM. But CNNs still appear in:

Mobile/edge: EfficientNet, MobileNet for on-device inference
Real-time detection: YOLO family for object detection at 30+ FPS
Hybrid architectures: ConvNeXt combines CNN efficiency with Transformer design principles
Medical imaging: Where labeled data is scarce and CNN inductive bias helps
Pro tip: When building multimodal applications, you rarely choose the vision encoder directly — it comes bundled with the VLM you select (GPT-4V, Gemini, LLaVA). But understanding the tradeoffs helps you choose the right VLM for your use case and debug when the model “doesn’t see” something.
architecture
The Vision Encoder Pipeline
End-to-end: from raw image to LLM understanding
Complete Pipeline
// Image → Understanding pipeline 1. Input 1024×1024 RGB image (3.1M values) 2. Preprocessing Resize, normalize (mean/std), pad 3. Patch Extraction Split into 14×14 patches (~5,329 patches) 4. Vision Encoder (ViT) Linear projection → + positional embeddings → Transformer layers → feature vectors 5. Compression 5,329 tokens → ~765 tokens (pooling/Q-Former) 6. Projection Vision space → LLM token embedding space 7. LLM Processing [visual tokens] + [text tokens] → understanding
Why This Matters
Every multimodal model you use — GPT-4V, Gemini, Claude, LLaVA — runs this pipeline (or a variant of it) under the hood. Understanding it helps you:

Choose the right model for your image complexity
Optimize token usage and control API costs
Debug failures when the model “doesn’t see” something (often a resolution or compression issue)
Understand tradeoffs between resolution, detail, and latency
Key insight: The bottleneck in multimodal understanding is usually the compression step. Going from 5,329 patches to 765 tokens means information is lost. When a model fails to read small text in an image, it’s often because the compression discarded those fine details.
school
Key Takeaways
What to remember about machine vision
The Essential Concepts
1. Pixels are raw data — too many dimensions for direct processing (3.1M values for a 1024×1024 image)

2. CNNs extract hierarchical features using local filters: edges → textures → parts → objects

3. ViTs split images into patches (14×14 or 16×16 pixels) and process them as token sequences with self-attention

4. Projectors bridge vision and language — mapping visual embeddings into the LLM’s token space

5. Token compression is the key tradeoff: more tokens = better detail but higher cost and latency
Practical Implications
• When a VLM can’t read small text in an image, try higher resolution mode (more tokens)
• When costs are too high, use low-res mode for images that don’t need fine detail
Crop and zoom into the relevant region before sending to the model
• The same image costs different token amounts across models — factor this into model selection
Next up: Chapter 3 covers the generative model family tree — VAEs, GANs, Normalizing Flows, and Diffusion models — how each generates images, their strengths and weaknesses, and why diffusion became the dominant approach.