Ch 8: Speech & Audio AI

Ch 8 — Speech & Audio AI

Whisper, ElevenLabs, music generation, audio tokenization, and real-time voice agents

Index

High Level

mic

Input

arrow_forward

graphic_eq

Encode

arrow_forward

record_voice_over

STT

arrow_forward

volume_up

TTS

arrow_forward

music_note

Music

arrow_forward

smart_toy

Agents

Click play or press Space to begin...

Step- / 8

graphic_eq

Audio Fundamentals

How sound becomes data that AI can process

From Waveform to Spectrogram

Raw audio is a 1D waveform — amplitude values sampled at 16,000–44,100 times per second. One second of CD-quality audio = 44,100 values. Too many for direct processing.

The solution: convert to a mel spectrogram — a 2D image-like representation showing frequency (y-axis) vs time (x-axis) with intensity as color. This lets us apply the same techniques used for images (CNNs, ViTs, patches) to audio.

Audio Tokenization

// How audio becomes tokens Raw waveform 44,100 samples/sec × 30 sec = 1.3M values Mel spectrogram 80 mel bins × 3,000 time frames = 240K values EnCodec tokens (Meta) ~75 tokens/sec × 30 sec = 2,250 tokens Discrete tokens, like text! // EnCodec compresses audio 300x // while preserving speech quality

Key insight: Audio tokenization (EnCodec, SoundStream) is the audio equivalent of image patching. It converts continuous audio into discrete tokens that can be processed by Transformers — enabling the same “everything is tokens” unification we saw for images.

record_voice_over

Whisper: Universal Speech Recognition

OpenAI’s multilingual STT that changed the game

How Whisper Works

Whisper is an encoder-decoder Transformer trained on 680,000 hours of multilingual audio:

1. Input: 30-second audio chunks converted to mel spectrograms
2. Encoder: Processes the spectrogram with Transformer layers
3. Decoder: Autoregressively generates text tokens
4. Output: Transcribed text with timestamps

Supports 99 languages, translation, and timestamp-level alignment.

Model Sizes

// Whisper model variants tiny 39M params ~32x real-time base 74M params ~16x real-time small 244M params ~6x real-time medium 769M params ~2x real-time large-v3 1.5B params ~1x real-time // "Real-time" = processes audio as fast // as it plays. large-v3 is near-human // accuracy across most languages.

Key insight: Whisper’s breakthrough was training on weakly supervised internet audio (680K hours) rather than carefully labeled data. This massive, noisy dataset gave it robustness to accents, background noise, and domain-specific vocabulary that previous models lacked.

volume_up

Text-to-Speech & Voice Cloning

ElevenLabs, OpenAI TTS, and the voice synthesis revolution

Modern TTS Architecture

Modern TTS systems generate speech that is indistinguishable from human voices:

1. Text analysis: Parse text, handle abbreviations, numbers, punctuation
2. Prosody prediction: Determine rhythm, stress, intonation
3. Audio generation: Produce waveform using neural vocoders or diffusion

The best systems (ElevenLabs, OpenAI TTS) produce natural-sounding speech with emotion, emphasis, and conversational flow.

Voice Cloning

Voice cloning creates a synthetic voice from a short audio sample:

• ElevenLabs: 30 seconds of reference audio for basic cloning
• Professional cloning: 3–5 minutes for high-fidelity reproduction
• Zero-shot: Some models clone from a single sentence

The TTS Landscape

// Major TTS providers (2025) ElevenLabs Best quality, voice cloning, 29 languages $0.30/1K chars, ~200ms latency OpenAI TTS 6 built-in voices, natural prosody $0.015/1K chars, ~500ms latency Google Cloud TTS WaveNet voices, 40+ languages $0.016/1K chars, enterprise features Open-source: Coqui, Bark, XTTS Free, customizable, self-hosted Quality approaching commercial

Key insight: TTS quality crossed the “uncanny valley” in 2023. Modern voices are so natural that listeners often can’t distinguish them from real humans. This enables voice agents, audiobook narration, and accessibility tools — but also raises deepfake concerns.

music_note

Music Generation

Suno, Udio, and AI-composed music

How Music AI Works

Music generation models typically use one of two approaches:

1. Audio diffusion: Generate raw audio waveforms using diffusion models conditioned on text descriptions. Similar to image diffusion but in the audio domain.
2. Token prediction: Encode music as discrete tokens (using EnCodec/SoundStream), then predict tokens autoregressively like a language model.

Most modern systems (Suno, Udio) combine both approaches with additional structure for lyrics, melody, and arrangement.

Major Platforms

• Suno: Full song generation (vocals + instruments) from text prompts. Generates 2–4 minute songs with lyrics, melody, and production.
• Udio: Similar to Suno, strong on vocal quality and genre diversity
• MusicLM (Google): Text-to-music, research model
• Stable Audio (Stability): Open-weight, good for sound effects and ambient music
• Meta’s MusicGen: Open-source, text-conditioned music generation

Key insight: Music generation is advancing faster than video generation. Suno can produce radio-quality songs in seconds. The copyright implications are enormous — lawsuits from major labels are ongoing, and the legal framework for AI-generated music is still being established.

surround_sound

Sound Effects & Audio Understanding

Generating and classifying non-speech audio

Sound Effect Generation

• Text-to-SFX: “Thunder rumbling in the distance, followed by heavy rain on a tin roof” → realistic audio
• Video-to-audio: Generate matching sound effects for silent video clips
• Foley automation: AI generates footsteps, door creaks, ambient sounds for film

Models: Stable Audio, AudioLDM, Make-An-Audio. These use diffusion in spectrogram space, conditioned on text descriptions via CLAP (the audio equivalent of CLIP).

Audio Understanding

• Audio classification: Identify sounds (dog barking, car horn, music genre)
• Speaker diarization: “Who spoke when?” in multi-speaker audio
• Emotion detection: Identify emotional tone from voice
• Audio captioning: Describe what’s happening in an audio clip
• CLAP embeddings: Shared text-audio embedding space (like CLIP for images)

Key insight: CLAP (Contrastive Language-Audio Pretraining) does for audio what CLIP does for images — creates a shared embedding space between text and audio. This enables zero-shot audio classification, text-to-audio generation, and audio search using text queries.

smart_toy

Real-Time Voice Agents

The convergence of STT + LLM + TTS for conversational AI

The Voice Agent Pipeline

// Traditional voice agent (cascaded) 1. Listen Whisper STT (~200ms) 2. Think LLM generates response (~500ms) 3. Speak TTS synthesizes audio (~200ms) Total: ~900ms latency (feels natural) // Native multimodal (GPT-4o style) 1. Listen Audio tokens directly into model 2. Respond Model outputs audio tokens directly Total: ~300ms latency (feels instant) // Native approach: no STT/TTS pipeline // Model processes audio natively

Native vs Cascaded

• Cascaded (STT → LLM → TTS): Modular, each component can be swapped. Higher latency but more controllable. Most current deployments use this.
• Native multimodal (GPT-4o): Model processes audio directly without transcription. Lower latency, preserves tone/emotion, can handle non-verbal cues (laughter, hesitation). The future direction.

Key insight: GPT-4o’s native audio mode was a paradigm shift. Instead of transcribing speech to text, processing text, and synthesizing speech, the model processes audio tokens directly. This preserves emotional nuance and enables sub-300ms response times.

account_tree

The Complete Audio AI Pipeline

From raw waveform to intelligent audio applications

The Pipeline

// Audio AI capabilities map Understanding STT: Whisper, Deepgram, AssemblyAI Class: Audio classification, tagging Embed: CLAP embeddings for search Generation TTS: ElevenLabs, OpenAI, Coqui Music: Suno, Udio, MusicGen SFX: Stable Audio, AudioLDM Transformation Clone: Voice cloning (ElevenLabs) Sep: Source separation (Demucs) Enh: Noise removal, enhancement Integration Agent: Real-time voice agents Dub: Automated dubbing/translation Sync: Lip sync for video

Production Applications

• Call centers: AI agents handle customer calls with natural voice
• Podcasting: Automated transcription, translation, voice cloning for multi-language
• Accessibility: Real-time captioning, audio descriptions, screen readers
• Gaming: Dynamic NPC dialogue generated in real-time
• Healthcare: Medical dictation with domain-specific accuracy
• Education: Personalized tutoring with natural voice interaction

Key insight: Audio AI is the most “production-ready” modality after text. Whisper STT and modern TTS are already deployed at massive scale in call centers, accessibility tools, and content creation. The technology is mature enough for real-world, customer-facing applications.

school

Key Takeaways

What to remember about speech and audio AI

Essential Concepts

1. Audio tokenization (EnCodec): Compresses audio 300x into discrete tokens — enabling Transformer processing

2. Whisper: Universal STT trained on 680K hours, 99 languages, near-human accuracy

3. Modern TTS: Indistinguishable from human voices; voice cloning from 30 seconds of audio

4. Music generation: Full songs from text prompts (Suno, Udio) — legal implications still unresolved

5. Native audio models: GPT-4o processes audio directly without STT/TTS pipeline — sub-300ms latency

The Audio Modality Advantage

Audio AI is the most mature non-text modality:

• STT: Production-ready, deployed at massive scale
• TTS: Crossed the uncanny valley, commercially available
• Voice agents: Already handling millions of customer calls
• Music: Generating radio-quality songs in seconds

Audio is where vision was 2 years ago — the technology works, and the ecosystem is rapidly maturing.

Next up: Chapter 9 shifts from generation to understanding — how Vision-Language Models (VLMs) work, the architecture behind GPT-4V, Gemini, and LLaVA, and how models learn to reason about images.

arrow_back Ch 7: Text-to-Video & Motion Ch 9: Vision-Language Models arrow_forward