Ch 7 — Text-to-Video & Motion

Sora’s DiT, spacetime patches, temporal coherence, and current capabilities vs hard limits
High Level
text_fields
Prompt
arrow_forward
view_in_ar
3D Patch
arrow_forward
movie
Generate
arrow_forward
sync
Coherence
arrow_forward
speed
Limits
arrow_forward
rocket_launch
Future
-
Click play or press Space to begin...
Step- / 8
movie
From Images to Video
Why video generation is orders of magnitude harder
The Scale Challenge
// Data scale: image vs video 1 image (1024×1024): 3.1M pixel values ~765 tokens 1 second of video (1080p, 24fps): 24 frames × 2M pixels = 149M values ~18,000 tokens 10 seconds of video: 1.49 BILLION values ~180,000 tokens // Video = images + TIME dimension // 100-1000x more data than a single image
The Temporal Challenge
Generating beautiful individual frames is “solved” by image diffusion. The hard part is temporal coherence:

Object persistence: A car must look the same across all frames
Motion consistency: Movement must follow physics (gravity, momentum)
Camera coherence: Camera movements must be smooth and realistic
Lighting continuity: Shadows and reflections must be consistent
Cause and effect: Actions must have logical consequences
Key insight: Video generation isn’t just “many images in sequence.” It requires understanding time, physics, and causality. A model that generates each frame independently produces unwatchable flickering. The breakthrough was learning to generate all frames together.
view_in_ar
Spacetime Patches & DiT
Sora’s Diffusion Transformer architecture
Spacetime Patches
Just as ViT splits images into 2D patches, video models split video into 3D spacetime patches. Each patch covers a small spatial region across several frames:

Spatial: 2×16×16 pixels (height × width)
Temporal: 1–4 frames per patch
Result: A 10-second 1080p video becomes ~100,000 spacetime tokens

These 3D patches capture both appearance and motion in a single token.
Sora’s DiT Architecture
Sora uses a Diffusion Transformer (DiT) instead of a U-Net:

Input: Spacetime patches of noisy latent video
Architecture: Pure Transformer with self-attention across all spacetime tokens
Conditioning: Text embeddings via cross-attention (like image diffusion)
Output: Denoised spacetime patches

The key advantage: self-attention across all spacetime tokens means every frame “sees” every other frame, enabling temporal coherence.
Key insight: Sora’s breakthrough was treating video as a single sequence of spacetime tokens and using a Transformer to attend across all of them. This is computationally expensive but produces far better temporal coherence than frame-by-frame approaches.
sync
Temporal Coherence
How models maintain consistency across frames
Approaches to Coherence
Joint generation (Sora): Generate all frames simultaneously with full spacetime attention. Best coherence but most expensive.
Temporal attention layers: Add temporal self-attention between frames in the U-Net. Used by Runway Gen-3, AnimateDiff.
Autoregressive frames: Generate keyframes first, then interpolate. Used by some video models for longer clips.
Motion modules: Separate modules that learn motion patterns, plugged into image diffusion models.
What Models Learn About Physics
Sora’s technical report revealed that large video models learn emergent physics simulation:

3D consistency: Objects maintain shape as camera moves around them
Reflections: Mirrors and water surfaces show correct reflections
Shadows: Cast shadows move consistently with light sources
Fluid dynamics: Water, smoke, and fire behave realistically

These physics aren’t programmed — they emerge from training on millions of videos.
Key insight: Video models are learning to be “world simulators” — they don’t just generate pixels, they learn implicit models of physics, geometry, and causality. This is why video generation is considered a path toward general world understanding.
landscape
The Video Model Landscape
Sora, Runway, Veo, Kling, and the open-source frontier
Major Models
// Text-to-video models (2025) Sora (OpenAI) DiT architecture, up to 60 sec 1080p, best physics understanding Closed, limited access Veo 2 (Google) Up to 120 sec, 4K resolution Strong prompt adherence Available via Vertex AI Runway Gen-3 Alpha 10 sec clips, fast generation Strong motion control Commercial API available Kling (Kuaishou) Up to 120 sec, 1080p Strong character consistency Available via API Open-source: CogVideoX, Mochi Shorter clips, lower quality Rapidly improving
Capabilities Comparison
Best quality: Sora, Veo 2 (photorealistic, coherent motion)
Best length: Veo 2, Kling (up to 2 minutes)
Best control: Runway Gen-3 (motion brushes, camera control)
Best accessibility: Runway, Kling (commercial APIs)
Best open-source: CogVideoX (improving rapidly)
Key insight: Video generation in 2025 is where image generation was in 2022 — impressive demos but not yet production-ready for most use cases. Quality is improving exponentially, and the gap between leaders (Sora, Veo) and open-source is closing.
warning
Current Hard Limits
What video generation still can’t do reliably
Known Failures
Physics violations: Objects passing through each other, impossible gravity, liquids behaving wrong
Counting & consistency: Number of fingers, legs, or objects changes between frames
Long-term coherence: Beyond 30 seconds, characters and scenes drift
Text in video: Text on signs, screens, or documents is garbled
Complex interactions: Multiple characters interacting physically (handshakes, fights)
Cause and effect: Blowing out candles doesn’t reliably extinguish them
Compute & Cost
// Video generation costs (approximate) Sora ~$0.15-0.50 per 10-sec clip Runway ~$0.05-0.25 per 10-sec clip Kling ~$0.02-0.10 per 10-sec clip // Generation time Sora 2-10 minutes per clip Runway 30-90 seconds per clip Kling 1-5 minutes per clip // Compare: text-to-image is ~$0.01 // and takes 1-5 seconds
Key insight: Video generation is 10–50x more expensive and 10–100x slower than image generation. This isn’t just an engineering problem — video has fundamentally more data (time dimension). Costs will drop, but the gap with images will persist.
videocam
Image-to-Video & Video Editing
Beyond text-to-video: animating images and editing clips
Image-to-Video
Provide a reference image and the model animates it into a video clip. This is often more controllable than pure text-to-video because you start with a known visual:

Runway Gen-3: Upload image + describe motion
Stable Video Diffusion: Open-source image animation
Kling: Strong character animation from single images

Use case: animate product shots, bring illustrations to life, create dynamic social media content.
Video Editing with AI
Video inpainting: Remove or replace objects across frames
Style transfer: Apply artistic styles to existing footage
Motion transfer: Apply motion from one video to another
Video upscaling: Enhance resolution of existing footage
Frame interpolation: Generate intermediate frames for slow-motion
Lip sync: Match mouth movements to new audio
Key insight: Image-to-video is currently more practical than pure text-to-video for production use. Starting from a carefully crafted image gives you control over composition, style, and content — the model just adds motion.
work
Production Use Cases
Where video generation is already creating value
Working Today
Social media content: Short clips for TikTok, Instagram Reels (5–15 sec)
Advertising: Concept videos and storyboard animation for pitches
E-commerce: Product videos from still photos
Music videos: Abstract and artistic visual accompaniment
Education: Animated explainers and visualizations
Film pre-visualization: Rapid storyboard-to-video for directors
Not Yet Ready For
Feature films: Coherence breaks down beyond 30 seconds; character consistency is unreliable
Live sports/news: Real-time generation not fast enough; accuracy critical
Medical/legal: Hallucinated details are unacceptable
Long-form content: Multi-minute coherent narratives still out of reach
Interactive/gaming: Real-time generation at 30+ FPS not yet feasible
Key insight: Video generation excels at short, creative, non-critical content. The sweet spot is 5–15 second clips where slight imperfections are acceptable. For longer or accuracy-critical content, traditional production is still necessary.
rocket_launch
The Future of Video Generation
Where this technology is heading
Near-Term (2025–2026)
Longer coherent clips: 2–5 minute videos with consistent characters
Better control: Camera paths, character actions, scene transitions
Real-time preview: See rough video as you type the prompt
Audio integration: Synchronized sound effects and music
Open-source parity: Open models matching closed-source quality
Long-Term Vision
Interactive video: Generate video in real-time based on user input (gaming, simulation)
World models: Video models that truly understand physics and can predict outcomes
Personalized content: AI-generated shows tailored to individual viewers
Film production: Full scenes with dialogue, consistent characters, and narrative arcs
Embodied AI: Video generation as a planning tool for robots
Next up: Chapter 8 covers speech and audio AI — Whisper for recognition, ElevenLabs for synthesis, music generation, audio tokenization, and real-time voice agents.