The Challenge
Video generation must produce temporally coherent frames — objects must move smoothly, physics must be plausible, and style must remain consistent. This is much harder than image generation: a 10-second 30fps video at 1080p has 300 frames, each needing to be consistent with all others.
Sora (OpenAI, 2024):
Diffusion transformer on spacetime patches
Up to 60 seconds, 1080p
Understands 3D space and physics
Sora 2 (2025):
Synchronized audio generation
Superior physics realism
1080p, 16–20 seconds via API
Other models:
Runway Gen-3, Pika, Kling, Veo (Google)
Open source: CogVideo, Open-Sora
How Video Models Work
Most video models extend image diffusion to 3D: the denoiser processes spacetime patches (spatial + temporal). Temporal attention layers ensure frame-to-frame consistency. Some models generate keyframes first, then interpolate. The compute cost is enormous — video generation requires 100–1000x more compute than image generation.
The frontier: Video generation is where image generation was in 2021 — impressive but not yet reliable. Challenges include consistent character identity, accurate physics, long-duration coherence, and real-time generation. Rapid progress suggests these will be solved within 1–2 years.