Ch 7: Text-to-Video & Motion

Ch 7 — Text-to-Video & Motion

Sora’s DiT, spacetime patches, temporal coherence, and current capabilities vs hard limits

Index

High Level

text_fields

Prompt

arrow_forward

view_in_ar

3D Patch

arrow_forward

movie

Generate

arrow_forward

sync

Coherence

arrow_forward

speed

Limits

arrow_forward

rocket_launch

Future

Click play or press Space to begin...

Step- / 8

movie

From Images to Video

Why video generation is orders of magnitude harder

The Scale Challenge

// Data scale: image vs video 1 image (1024×1024): 3.1M pixel values ~765 tokens 1 second of video (1080p, 24fps): 24 frames × 2M pixels = 149M values ~18,000 tokens 10 seconds of video: 1.49 BILLION values ~180,000 tokens // Video = images + TIME dimension // 100-1000x more data than a single image

The Temporal Challenge

Generating beautiful individual frames is “solved” by image diffusion. The hard part is temporal coherence:

• Object persistence: A car must look the same across all frames
• Motion consistency: Movement must follow physics (gravity, momentum)
• Camera coherence: Camera movements must be smooth and realistic
• Lighting continuity: Shadows and reflections must be consistent
• Cause and effect: Actions must have logical consequences

Key insight: Video generation isn’t just “many images in sequence.” It requires understanding time, physics, and causality. A model that generates each frame independently produces unwatchable flickering. The breakthrough was learning to generate all frames together.

view_in_ar

Spacetime Patches & DiT

Sora’s Diffusion Transformer architecture

Spacetime Patches

Just as ViT splits images into 2D patches, video models split video into 3D spacetime patches. Each patch covers a small spatial region across several frames:

• Spatial: 2×16×16 pixels (height × width)
• Temporal: 1–4 frames per patch
• Result: A 10-second 1080p video becomes ~100,000 spacetime tokens

These 3D patches capture both appearance and motion in a single token.

Sora’s DiT Architecture

Sora uses a Diffusion Transformer (DiT) instead of a U-Net:

• Input: Spacetime patches of noisy latent video
• Architecture: Pure Transformer with self-attention across all spacetime tokens
• Conditioning: Text embeddings via cross-attention (like image diffusion)
• Output: Denoised spacetime patches

The key advantage: self-attention across all spacetime tokens means every frame “sees” every other frame, enabling temporal coherence.

Key insight: Sora’s breakthrough was treating video as a single sequence of spacetime tokens and using a Transformer to attend across all of them. This is computationally expensive but produces far better temporal coherence than frame-by-frame approaches.

sync

Temporal Coherence

How models maintain consistency across frames

Approaches to Coherence

• Joint generation (Sora): Generate all frames simultaneously with full spacetime attention. Best coherence but most expensive.
• Temporal attention layers: Add temporal self-attention between frames in the U-Net. Used by Runway Gen-3, AnimateDiff.
• Autoregressive frames: Generate keyframes first, then interpolate. Used by some video models for longer clips.
• Motion modules: Separate modules that learn motion patterns, plugged into image diffusion models.

What Models Learn About Physics

Sora’s technical report revealed that large video models learn emergent physics simulation:

• 3D consistency: Objects maintain shape as camera moves around them
• Reflections: Mirrors and water surfaces show correct reflections
• Shadows: Cast shadows move consistently with light sources
• Fluid dynamics: Water, smoke, and fire behave realistically

These physics aren’t programmed — they emerge from training on millions of videos.

Key insight: Video models are learning to be “world simulators” — they don’t just generate pixels, they learn implicit models of physics, geometry, and causality. This is why video generation is considered a path toward general world understanding.

landscape

The Video Model Landscape

Sora, Runway, Veo, Kling, and the open-source frontier

Major Models

// Text-to-video models (2025) Sora (OpenAI) DiT architecture, up to 60 sec 1080p, best physics understanding Closed, limited access Veo 2 (Google) Up to 120 sec, 4K resolution Strong prompt adherence Available via Vertex AI Runway Gen-3 Alpha 10 sec clips, fast generation Strong motion control Commercial API available Kling (Kuaishou) Up to 120 sec, 1080p Strong character consistency Available via API Open-source: CogVideoX, Mochi Shorter clips, lower quality Rapidly improving

Capabilities Comparison

• Best quality: Sora, Veo 2 (photorealistic, coherent motion)
• Best length: Veo 2, Kling (up to 2 minutes)
• Best control: Runway Gen-3 (motion brushes, camera control)
• Best accessibility: Runway, Kling (commercial APIs)
• Best open-source: CogVideoX (improving rapidly)

Key insight: Video generation in 2025 is where image generation was in 2022 — impressive demos but not yet production-ready for most use cases. Quality is improving exponentially, and the gap between leaders (Sora, Veo) and open-source is closing.

warning

Current Hard Limits

What video generation still can’t do reliably

Known Failures

• Physics violations: Objects passing through each other, impossible gravity, liquids behaving wrong
• Counting & consistency: Number of fingers, legs, or objects changes between frames
• Long-term coherence: Beyond 30 seconds, characters and scenes drift
• Text in video: Text on signs, screens, or documents is garbled
• Complex interactions: Multiple characters interacting physically (handshakes, fights)
• Cause and effect: Blowing out candles doesn’t reliably extinguish them

Compute & Cost

// Video generation costs (approximate) Sora ~$0.15-0.50 per 10-sec clip Runway ~$0.05-0.25 per 10-sec clip Kling ~$0.02-0.10 per 10-sec clip // Generation time Sora 2-10 minutes per clip Runway 30-90 seconds per clip Kling 1-5 minutes per clip // Compare: text-to-image is ~$0.01 // and takes 1-5 seconds

Key insight: Video generation is 10–50x more expensive and 10–100x slower than image generation. This isn’t just an engineering problem — video has fundamentally more data (time dimension). Costs will drop, but the gap with images will persist.

videocam

Image-to-Video & Video Editing

Beyond text-to-video: animating images and editing clips

Image-to-Video

Provide a reference image and the model animates it into a video clip. This is often more controllable than pure text-to-video because you start with a known visual:

• Runway Gen-3: Upload image + describe motion
• Stable Video Diffusion: Open-source image animation
• Kling: Strong character animation from single images

Use case: animate product shots, bring illustrations to life, create dynamic social media content.

Video Editing with AI

• Video inpainting: Remove or replace objects across frames
• Style transfer: Apply artistic styles to existing footage
• Motion transfer: Apply motion from one video to another
• Video upscaling: Enhance resolution of existing footage
• Frame interpolation: Generate intermediate frames for slow-motion
• Lip sync: Match mouth movements to new audio

Key insight: Image-to-video is currently more practical than pure text-to-video for production use. Starting from a carefully crafted image gives you control over composition, style, and content — the model just adds motion.

work

Production Use Cases

Where video generation is already creating value

Working Today

• Social media content: Short clips for TikTok, Instagram Reels (5–15 sec)
• Advertising: Concept videos and storyboard animation for pitches
• E-commerce: Product videos from still photos
• Music videos: Abstract and artistic visual accompaniment
• Education: Animated explainers and visualizations
• Film pre-visualization: Rapid storyboard-to-video for directors

Not Yet Ready For

• Feature films: Coherence breaks down beyond 30 seconds; character consistency is unreliable
• Live sports/news: Real-time generation not fast enough; accuracy critical
• Medical/legal: Hallucinated details are unacceptable
• Long-form content: Multi-minute coherent narratives still out of reach
• Interactive/gaming: Real-time generation at 30+ FPS not yet feasible

Key insight: Video generation excels at short, creative, non-critical content. The sweet spot is 5–15 second clips where slight imperfections are acceptable. For longer or accuracy-critical content, traditional production is still necessary.

rocket_launch

The Future of Video Generation

Where this technology is heading

Near-Term (2025–2026)

• Longer coherent clips: 2–5 minute videos with consistent characters
• Better control: Camera paths, character actions, scene transitions
• Real-time preview: See rough video as you type the prompt
• Audio integration: Synchronized sound effects and music
• Open-source parity: Open models matching closed-source quality

Long-Term Vision

• Interactive video: Generate video in real-time based on user input (gaming, simulation)
• World models: Video models that truly understand physics and can predict outcomes
• Personalized content: AI-generated shows tailored to individual viewers
• Film production: Full scenes with dialogue, consistent characters, and narrative arcs
• Embodied AI: Video generation as a planning tool for robots

Next up: Chapter 8 covers speech and audio AI — Whisper for recognition, ElevenLabs for synthesis, music generation, audio tokenization, and real-time voice agents.

arrow_back Ch 6: Text-to-Image Generation Ch 8: Speech & Audio AI arrow_forward