What’s Next
The trajectory is clear: everything becomes tokens. Text, images, audio, video, 3D objects, actions, sensor data — all tokenized and processed by the same transformer architecture. Future models will seamlessly switch between modalities: see a diagram, explain it verbally, generate an improved version, and write code to implement it. The transformer’s modality-agnostic attention mechanism makes this possible.
Key insight: The unifying principle across all 12 chapters so far: everything is a sequence of vectors processed by attention. Text tokens, image patches, audio frames, video frames — they all become vectors in the same high-dimensional space. The transformer doesn’t care what the vectors represent. This architectural universality is why the same basic design (Ch 4) powers text, vision, audio, code, and multimodal AI.
The Unifying Principle
# Everything is tokens:
# Text: "Hello" → [9906] → embed → vector
# Image: 16×16 patch → linear → vector
# Audio: 20ms frame → codec → vector
# Video: frame → ViT → vectors
# Code: "def f" → [755, 282] → vectors
# All fed into the same transformer:
# [text, image, audio, text, image, ...]
# Attention connects everything
# The model learns cross-modal relationships
# Future capabilities:
# - Real-time video understanding
# - Robotic action generation
# - 3D scene understanding
# - Scientific data (proteins, molecules)
# - Any modality that can be tokenized