The Essential Concepts
1. Pixels are raw data — too many dimensions for direct processing (3.1M values for a 1024×1024 image)
2. CNNs extract hierarchical features using local filters: edges → textures → parts → objects
3. ViTs split images into patches (14×14 or 16×16 pixels) and process them as token sequences with self-attention
4. Projectors bridge vision and language — mapping visual embeddings into the LLM’s token space
5. Token compression is the key tradeoff: more tokens = better detail but higher cost and latency
Practical Implications
• When a VLM can’t read small text in an image, try higher resolution mode (more tokens)
• When costs are too high, use low-res mode for images that don’t need fine detail
• Crop and zoom into the relevant region before sending to the model
• The same image costs different token amounts across models — factor this into model selection
Next up: Chapter 3 covers the generative model family tree — VAEs, GANs, Normalizing Flows, and Diffusion models — how each generates images, their strengths and weaknesses, and why diffusion became the dominant approach.