Near-Term (2025–2026)
• Longer coherent clips: 2–5 minute videos with consistent characters
• Better control: Camera paths, character actions, scene transitions
• Real-time preview: See rough video as you type the prompt
• Audio integration: Synchronized sound effects and music
• Open-source parity: Open models matching closed-source quality
Long-Term Vision
• Interactive video: Generate video in real-time based on user input (gaming, simulation)
• World models: Video models that truly understand physics and can predict outcomes
• Personalized content: AI-generated shows tailored to individual viewers
• Film production: Full scenes with dialogue, consistent characters, and narrative arcs
• Embodied AI: Video generation as a planning tool for robots
Next up: Chapter 8 covers speech and audio AI — Whisper for recognition, ElevenLabs for synthesis, music generation, audio tokenization, and real-time voice agents.