Ch 8 — Speech & Audio AI

Whisper, ElevenLabs, music generation, audio tokenization, and real-time voice agents
High Level
mic
Input
arrow_forward
graphic_eq
Encode
arrow_forward
record_voice_over
STT
arrow_forward
volume_up
TTS
arrow_forward
music_note
Music
arrow_forward
smart_toy
Agents
-
Click play or press Space to begin...
Step- / 8
graphic_eq
Audio Fundamentals
How sound becomes data that AI can process
From Waveform to Spectrogram
Raw audio is a 1D waveform — amplitude values sampled at 16,000–44,100 times per second. One second of CD-quality audio = 44,100 values. Too many for direct processing.

The solution: convert to a mel spectrogram — a 2D image-like representation showing frequency (y-axis) vs time (x-axis) with intensity as color. This lets us apply the same techniques used for images (CNNs, ViTs, patches) to audio.
Audio Tokenization
// How audio becomes tokens Raw waveform 44,100 samples/sec × 30 sec = 1.3M values Mel spectrogram 80 mel bins × 3,000 time frames = 240K values EnCodec tokens (Meta) ~75 tokens/sec × 30 sec = 2,250 tokens Discrete tokens, like text! // EnCodec compresses audio 300x // while preserving speech quality
Key insight: Audio tokenization (EnCodec, SoundStream) is the audio equivalent of image patching. It converts continuous audio into discrete tokens that can be processed by Transformers — enabling the same “everything is tokens” unification we saw for images.
record_voice_over
Whisper: Universal Speech Recognition
OpenAI’s multilingual STT that changed the game
How Whisper Works
Whisper is an encoder-decoder Transformer trained on 680,000 hours of multilingual audio:

1. Input: 30-second audio chunks converted to mel spectrograms
2. Encoder: Processes the spectrogram with Transformer layers
3. Decoder: Autoregressively generates text tokens
4. Output: Transcribed text with timestamps

Supports 99 languages, translation, and timestamp-level alignment.
Model Sizes
// Whisper model variants tiny 39M params ~32x real-time base 74M params ~16x real-time small 244M params ~6x real-time medium 769M params ~2x real-time large-v3 1.5B params ~1x real-time // "Real-time" = processes audio as fast // as it plays. large-v3 is near-human // accuracy across most languages.
Key insight: Whisper’s breakthrough was training on weakly supervised internet audio (680K hours) rather than carefully labeled data. This massive, noisy dataset gave it robustness to accents, background noise, and domain-specific vocabulary that previous models lacked.
volume_up
Text-to-Speech & Voice Cloning
ElevenLabs, OpenAI TTS, and the voice synthesis revolution
Modern TTS Architecture
Modern TTS systems generate speech that is indistinguishable from human voices:

1. Text analysis: Parse text, handle abbreviations, numbers, punctuation
2. Prosody prediction: Determine rhythm, stress, intonation
3. Audio generation: Produce waveform using neural vocoders or diffusion

The best systems (ElevenLabs, OpenAI TTS) produce natural-sounding speech with emotion, emphasis, and conversational flow.
Voice Cloning
Voice cloning creates a synthetic voice from a short audio sample:

ElevenLabs: 30 seconds of reference audio for basic cloning
Professional cloning: 3–5 minutes for high-fidelity reproduction
Zero-shot: Some models clone from a single sentence
The TTS Landscape
// Major TTS providers (2025) ElevenLabs Best quality, voice cloning, 29 languages $0.30/1K chars, ~200ms latency OpenAI TTS 6 built-in voices, natural prosody $0.015/1K chars, ~500ms latency Google Cloud TTS WaveNet voices, 40+ languages $0.016/1K chars, enterprise features Open-source: Coqui, Bark, XTTS Free, customizable, self-hosted Quality approaching commercial
Key insight: TTS quality crossed the “uncanny valley” in 2023. Modern voices are so natural that listeners often can’t distinguish them from real humans. This enables voice agents, audiobook narration, and accessibility tools — but also raises deepfake concerns.
music_note
Music Generation
Suno, Udio, and AI-composed music
How Music AI Works
Music generation models typically use one of two approaches:

1. Audio diffusion: Generate raw audio waveforms using diffusion models conditioned on text descriptions. Similar to image diffusion but in the audio domain.
2. Token prediction: Encode music as discrete tokens (using EnCodec/SoundStream), then predict tokens autoregressively like a language model.

Most modern systems (Suno, Udio) combine both approaches with additional structure for lyrics, melody, and arrangement.
Major Platforms
Suno: Full song generation (vocals + instruments) from text prompts. Generates 2–4 minute songs with lyrics, melody, and production.
Udio: Similar to Suno, strong on vocal quality and genre diversity
MusicLM (Google): Text-to-music, research model
Stable Audio (Stability): Open-weight, good for sound effects and ambient music
Meta’s MusicGen: Open-source, text-conditioned music generation
Key insight: Music generation is advancing faster than video generation. Suno can produce radio-quality songs in seconds. The copyright implications are enormous — lawsuits from major labels are ongoing, and the legal framework for AI-generated music is still being established.
surround_sound
Sound Effects & Audio Understanding
Generating and classifying non-speech audio
Sound Effect Generation
Text-to-SFX: “Thunder rumbling in the distance, followed by heavy rain on a tin roof” → realistic audio
Video-to-audio: Generate matching sound effects for silent video clips
Foley automation: AI generates footsteps, door creaks, ambient sounds for film

Models: Stable Audio, AudioLDM, Make-An-Audio. These use diffusion in spectrogram space, conditioned on text descriptions via CLAP (the audio equivalent of CLIP).
Audio Understanding
Audio classification: Identify sounds (dog barking, car horn, music genre)
Speaker diarization: “Who spoke when?” in multi-speaker audio
Emotion detection: Identify emotional tone from voice
Audio captioning: Describe what’s happening in an audio clip
CLAP embeddings: Shared text-audio embedding space (like CLIP for images)
Key insight: CLAP (Contrastive Language-Audio Pretraining) does for audio what CLIP does for images — creates a shared embedding space between text and audio. This enables zero-shot audio classification, text-to-audio generation, and audio search using text queries.
smart_toy
Real-Time Voice Agents
The convergence of STT + LLM + TTS for conversational AI
The Voice Agent Pipeline
// Traditional voice agent (cascaded) 1. Listen Whisper STT (~200ms) 2. Think LLM generates response (~500ms) 3. Speak TTS synthesizes audio (~200ms) Total: ~900ms latency (feels natural) // Native multimodal (GPT-4o style) 1. Listen Audio tokens directly into model 2. Respond Model outputs audio tokens directly Total: ~300ms latency (feels instant) // Native approach: no STT/TTS pipeline // Model processes audio natively
Native vs Cascaded
Cascaded (STT → LLM → TTS): Modular, each component can be swapped. Higher latency but more controllable. Most current deployments use this.
Native multimodal (GPT-4o): Model processes audio directly without transcription. Lower latency, preserves tone/emotion, can handle non-verbal cues (laughter, hesitation). The future direction.
Key insight: GPT-4o’s native audio mode was a paradigm shift. Instead of transcribing speech to text, processing text, and synthesizing speech, the model processes audio tokens directly. This preserves emotional nuance and enables sub-300ms response times.
account_tree
The Complete Audio AI Pipeline
From raw waveform to intelligent audio applications
The Pipeline
// Audio AI capabilities map Understanding STT: Whisper, Deepgram, AssemblyAI Class: Audio classification, tagging Embed: CLAP embeddings for search Generation TTS: ElevenLabs, OpenAI, Coqui Music: Suno, Udio, MusicGen SFX: Stable Audio, AudioLDM Transformation Clone: Voice cloning (ElevenLabs) Sep: Source separation (Demucs) Enh: Noise removal, enhancement Integration Agent: Real-time voice agents Dub: Automated dubbing/translation Sync: Lip sync for video
Production Applications
Call centers: AI agents handle customer calls with natural voice
Podcasting: Automated transcription, translation, voice cloning for multi-language
Accessibility: Real-time captioning, audio descriptions, screen readers
Gaming: Dynamic NPC dialogue generated in real-time
Healthcare: Medical dictation with domain-specific accuracy
Education: Personalized tutoring with natural voice interaction
Key insight: Audio AI is the most “production-ready” modality after text. Whisper STT and modern TTS are already deployed at massive scale in call centers, accessibility tools, and content creation. The technology is mature enough for real-world, customer-facing applications.
school
Key Takeaways
What to remember about speech and audio AI
Essential Concepts
1. Audio tokenization (EnCodec): Compresses audio 300x into discrete tokens — enabling Transformer processing

2. Whisper: Universal STT trained on 680K hours, 99 languages, near-human accuracy

3. Modern TTS: Indistinguishable from human voices; voice cloning from 30 seconds of audio

4. Music generation: Full songs from text prompts (Suno, Udio) — legal implications still unresolved

5. Native audio models: GPT-4o processes audio directly without STT/TTS pipeline — sub-300ms latency
The Audio Modality Advantage
Audio AI is the most mature non-text modality:

STT: Production-ready, deployed at massive scale
TTS: Crossed the uncanny valley, commercially available
Voice agents: Already handling millions of customer calls
Music: Generating radio-quality songs in seconds

Audio is where vision was 2 years ago — the technology works, and the ecosystem is rapidly maturing.
Next up: Chapter 9 shifts from generation to understanding — how Vision-Language Models (VLMs) work, the architecture behind GPT-4V, Gemini, and LLaVA, and how models learn to reason about images.