Essential Concepts
1. Audio tokenization (EnCodec): Compresses audio 300x into discrete tokens — enabling Transformer processing
2. Whisper: Universal STT trained on 680K hours, 99 languages, near-human accuracy
3. Modern TTS: Indistinguishable from human voices; voice cloning from 30 seconds of audio
4. Music generation: Full songs from text prompts (Suno, Udio) — legal implications still unresolved
5. Native audio models: GPT-4o processes audio directly without STT/TTS pipeline — sub-300ms latency
The Audio Modality Advantage
Audio AI is the most mature non-text modality:
• STT: Production-ready, deployed at massive scale
• TTS: Crossed the uncanny valley, commercially available
• Voice agents: Already handling millions of customer calls
• Music: Generating radio-quality songs in seconds
Audio is where vision was 2 years ago — the technology works, and the ecosystem is rapidly maturing.
Next up: Chapter 9 shifts from generation to understanding — how Vision-Language Models (VLMs) work, the architecture behind GPT-4V, Gemini, and LLaVA, and how models learn to reason about images.