Image & Video

Nvidia SANA Video 2B

This ultra-efficient diffusion model slashes training costs to 1% of competitors and runs on consumer RTX 5090 GPUs.

Deep Dive

Nvidia's research team has unveiled SANA-Video 2B, a groundbreaking, ultra-efficient diffusion model designed to generate high-quality, minute-long videos at 720×1280 resolution. The model's core innovation is its Linear DiT architecture, which replaces standard attention with linear attention, drastically improving efficiency when processing the massive number of tokens required for video. This is paired with a novel constant-memory KV cache for block linear attention, enabling the model to maintain global context for long sequences without the traditional memory bottleneck that cripples other models. These technical leaps allow SANA-Video to synthesize long-form content that was previously computationally prohibitive.

This efficiency translates into staggering cost and speed advantages. Training SANA-Video 2B required just 12 days on 64 H100 GPUs, costing only 1% of what it took to train a comparable model like MovieGen. In inference, it outperforms modern small models like Wan 2.1 and SkyReel-V2, being 16 times faster in measured latency. Crucially, it's deployable on consumer hardware like the upcoming RTX 5090 GPU, where it accelerates the generation of a 5-second 720p video from 71 seconds down to just 29 seconds—a 2.4x speedup. This combination of low-cost training and fast, accessible inference sets a new benchmark for democratizing high-quality video generation, moving it from the realm of cloud supercomputers to powerful desktop workstations.

Key Points
  • Uses Linear DiT and constant-memory KV cache for efficient long-context video synthesis, enabling minute-long 720p generation.
  • Training cost is only 1% of MovieGen's, completed in 12 days on 64 H100 GPUs.
  • Runs 16x faster than rivals like Wan 2.1 and achieves a 2.4x speedup on RTX 5090 GPUs, cutting 5-second video generation to 29 seconds.

Why It Matters

Drastically lowers the cost and hardware barrier for high-quality AI video generation, enabling faster prototyping and new creative applications on consumer GPUs.