Ch 2: How Machines See — From Pixels to Patches

Ch 2 — How Machines See: From Pixels to Patches

Pixels, CNNs, Vision Transformers, image patching, and how images become tokens for LLMs

Index

High Level

grid_on

Pixels

arrow_forward

filter_alt

CNN

arrow_forward

auto_awesome

Features

arrow_forward

view_module

Patches

arrow_forward

token

Tokens

arrow_forward

smart_toy

LLM

Click play or press Space to begin...

Step- / 8

grid_on

Pixels: The Raw Material

What a computer actually “sees”

What Is a Pixel

A pixel is a single point of color, represented as three numbers (R, G, B) ranging from 0–255. A 1024×1024 image has 1,048,576 pixels × 3 channels = 3,145,728 values. This is far too many dimensions for a model to process directly — we need a way to compress images into meaningful representations.

The Resolution Problem

// Raw pixel counts by resolution 224×224 150,528 values (ImageNet standard) 512×512 786,432 values (Stable Diffusion) 1024×1024 3,145,728 values (high-res generation) 3840×2160 24,883,200 values (4K image) // Processing this directly would require // enormous compute and memory

Color Spaces

• RGB: Red, Green, Blue (most common, 3 channels)
• Grayscale: Single channel, 0–255
• HSV: Hue, Saturation, Value (useful for color-based tasks)
• RGBA: RGB + Alpha transparency channel

Most vision models work with RGB normalized to 0.0–1.0, often with mean subtraction and standard deviation scaling per channel.

Key insight: Raw pixels are like raw audio waveforms — they contain all the information but in a form that’s hard to reason about. A model looking at pixels sees numbers, not objects. We need to extract meaningful features — edges, textures, shapes, objects — from this sea of numbers.

filter_alt

CNNs: Learning to See

Convolutional Neural Networks extract visual features hierarchically

How Convolutions Work

A CNN slides small filters (kernels) — typically 3×3 or 5×5 — across the image. Each filter detects a specific pattern. Stacking layers creates a hierarchy:

• Layer 1: Edges, gradients, colors
• Layer 2: Textures, corners, simple shapes
• Layer 3: Parts (eyes, wheels, windows)
• Layer 4+: Objects, scenes, abstract concepts

Each layer builds on the previous, creating increasingly abstract representations.

Pooling & Feature Maps

Pooling reduces spatial dimensions (e.g., max pooling takes the maximum value in each 2×2 region). This creates a compression hierarchy: a 224×224 image becomes 112×112, then 56×56, then 28×28 — each level capturing more abstract features in fewer spatial dimensions.

Key CNN Architectures

// CNN evolution (layers / year / impact) AlexNet 8 layers 2012 Won ImageNet, started DL revolution VGG 19 layers 2014 Showed depth matters GoogLeNet 22 layers 2014 Inception modules, efficiency ResNet 152 layers 2015 Skip connections, solved vanishing grad EfficientNet varies 2019 Optimal depth/width/resolution scaling ConvNeXt varies 2022 CNN modernized with Transformer tricks

Key insight: CNNs were the dominant vision architecture for a decade (2012–2022). They’re still used as vision encoders in many multimodal models, but Vision Transformers are increasingly replacing them. Understanding CNNs is essential because many concepts (feature hierarchies, pooling, skip connections) carry over to modern architectures.

auto_awesome

Feature Hierarchies

What the model actually “sees” at each level

The Feature Pyramid

Both CNNs and Vision Transformers learn hierarchical representations that mirror how the human visual cortex processes information:

• Low-level (early layers): Edges, color gradients, simple textures
• Mid-level (middle layers): Parts, shapes, patterns — eyes, wheels, bricks
• High-level (final layers): Objects, scenes, concepts — “dog”, “beach”, “celebration”

This hierarchy emerges naturally from training — nobody explicitly teaches the model to detect edges first.

Attention Maps in ViTs

In Vision Transformers, attention maps reveal which patches the model focuses on. For a photo of a dog in a park:

• Early layers: Attend to edges and textures uniformly
• Middle layers: Focus on the dog’s face, body outline
• Final layers: Attend to the dog as a whole object, with background suppressed

This shows the model has learned to “see” meaningful objects, not just pixel patterns.

Key insight: The features learned by vision models are remarkably similar to what neuroscientists observe in the human visual cortex (V1 → V2 → V4 → IT). Both learn edges first, then textures, then objects, then scenes. This convergence suggests these hierarchies are fundamental to visual understanding.

view_module

Vision Transformers (ViT)

Treating images as sequences of patches

The ViT Innovation

Instead of convolutions, ViT splits an image into fixed-size patches (typically 14×14 or 16×16 pixels) and treats each patch as a token. These patch tokens are linearly projected into embeddings, a [CLS] token is prepended, positional embeddings are added, and the whole sequence is processed by a standard Transformer encoder with self-attention.

Patch Math

// How images become token sequences 224×224 image with 16×16 patches: 224 ÷ 16 = 14 patches per side 14 × 14 = 196 patch tokens 1024×1024 image with 14×14 patches: 1024 ÷ 14 ≈ 73 patches per side 73 × 73 = ~5,329 patch tokens // Each patch: 14×14×3 = 588 pixel values // Projected to a d-dimensional embedding (e.g., 768)

Why ViT Won

• Global attention: Every patch attends to every other patch from layer 1 (CNNs only see local 3×3 neighborhoods)
• Scalability: Performance improves smoothly with more data and compute — follows scaling laws
• Unification: Same Transformer architecture for vision and language — enables multimodal models
• Transfer learning: Pre-trained ViTs transfer well to downstream tasks with minimal fine-tuning

Key insight: ViT showed that Transformers don’t need convolutions to understand images. The self-attention mechanism naturally learns spatial relationships between patches. A patch of sky “knows” it’s above a patch of ground because attention learns these relationships from data.

token

From Patches to LLM Tokens

How images enter language models

The Projection Step

Raw patch embeddings from ViT live in vision space. LLMs operate in language space. A projector (typically a linear layer or small MLP) maps vision embeddings into the LLM’s token embedding space. This is the bridge between seeing and understanding — after projection, visual tokens are indistinguishable from text tokens to the LLM.

Token Compression

5,329 tokens per image would consume most of an LLM’s context window. Modern VLMs use compression techniques:

• Spatial pooling: Average adjacent patch tokens (2×2 → 4x reduction)
• Perceiver/Q-Former: Learn a fixed number of query tokens that attend to all patches
• Dynamic resolution: Adjust token count based on image complexity and task

Token Budget by Model

// How many tokens per image? GPT-4V ~765 tokens (low res) ~2,048 tokens (high res, tiled) Gemini ~258 tokens (efficient encoding) Claude ~1,600 tokens (per image) LLaVA ~576 tokens (open-source) // With 128K context window: // ~60-80 images + text can fit // Cost: ~$0.002-0.01 per image (API)

Key insight: The number of tokens per image directly affects cost, latency, and how many images fit in context. More tokens = better detail but higher cost. This is a key design tradeoff when building multimodal applications — you’re always balancing visual fidelity against token budget.

compare

CNN vs ViT: When to Use Which

Choosing the right vision backbone

CNN Advantages

• Data efficiency: Works well with smaller datasets (thousands, not millions)
• Inductive bias: Built-in understanding of spatial locality and translation invariance
• Speed: Faster inference for small images on edge devices
• Mobile: Efficient architectures like MobileNet, EfficientNet for phones and IoT

ViT Advantages

• Scalability: Performance improves predictably with more data and compute
• Global context: Every patch sees every other patch from layer 1
• Flexibility: Handles variable resolution and aspect ratios gracefully
• Unification: Same architecture as the LLM backbone — natural fit for multimodal

In Practice

Most modern multimodal models use ViT as the vision encoder because it integrates naturally with the Transformer-based LLM. But CNNs still appear in:

• Mobile/edge: EfficientNet, MobileNet for on-device inference
• Real-time detection: YOLO family for object detection at 30+ FPS
• Hybrid architectures: ConvNeXt combines CNN efficiency with Transformer design principles
• Medical imaging: Where labeled data is scarce and CNN inductive bias helps

Pro tip: When building multimodal applications, you rarely choose the vision encoder directly — it comes bundled with the VLM you select (GPT-4V, Gemini, LLaVA). But understanding the tradeoffs helps you choose the right VLM for your use case and debug when the model “doesn’t see” something.

architecture

The Vision Encoder Pipeline

End-to-end: from raw image to LLM understanding

Complete Pipeline

// Image → Understanding pipeline 1. Input 1024×1024 RGB image (3.1M values) 2. Preprocessing Resize, normalize (mean/std), pad 3. Patch Extraction Split into 14×14 patches (~5,329 patches) 4. Vision Encoder (ViT) Linear projection → + positional embeddings → Transformer layers → feature vectors 5. Compression 5,329 tokens → ~765 tokens (pooling/Q-Former) 6. Projection Vision space → LLM token embedding space 7. LLM Processing [visual tokens] + [text tokens] → understanding

Why This Matters

Every multimodal model you use — GPT-4V, Gemini, Claude, LLaVA — runs this pipeline (or a variant of it) under the hood. Understanding it helps you:

• Choose the right model for your image complexity
• Optimize token usage and control API costs
• Debug failures when the model “doesn’t see” something (often a resolution or compression issue)
• Understand tradeoffs between resolution, detail, and latency

Key insight: The bottleneck in multimodal understanding is usually the compression step. Going from 5,329 patches to 765 tokens means information is lost. When a model fails to read small text in an image, it’s often because the compression discarded those fine details.

school

Key Takeaways

What to remember about machine vision

The Essential Concepts

1. Pixels are raw data — too many dimensions for direct processing (3.1M values for a 1024×1024 image)

2. CNNs extract hierarchical features using local filters: edges → textures → parts → objects

3. ViTs split images into patches (14×14 or 16×16 pixels) and process them as token sequences with self-attention

4. Projectors bridge vision and language — mapping visual embeddings into the LLM’s token space

5. Token compression is the key tradeoff: more tokens = better detail but higher cost and latency

Practical Implications

• When a VLM can’t read small text in an image, try higher resolution mode (more tokens)
• When costs are too high, use low-res mode for images that don’t need fine detail
• Crop and zoom into the relevant region before sending to the model
• The same image costs different token amounts across models — factor this into model selection

Next up: Chapter 3 covers the generative model family tree — VAEs, GANs, Normalizing Flows, and Diffusion models — how each generates images, their strengths and weaknesses, and why diffusion became the dominant approach.

arrow_back Ch 1: The Multimodal Revolution Ch 3: Generative Model Family Tree arrow_forward