Image & Video

NVIDIA's Cosmos3-Super-Image2Video runs locally on a single 96GB GPU

1280x720 image-to-video generated in under 3 minutes on a workstation GPU.

Deep Dive

A developer has demonstrated that NVIDIA's Cosmos3-Super-Image2Video, a large video diffusion model, can run entirely locally on a single workstation GPU—specifically the RTX PRO 6000 Blackwell with 96GB VRAM. Using the BF16 precision variant and vLLM with layerwise offloading, the model was loaded on Ubuntu 24.04 with an NVIDIA driver 580.126.09 and CUDA 13.0. The system required 128GB of system RAM and an additional 128GB swap file to survive the shard loading phase, which initially failed without the swap safety net. Once loaded, inference used SDPA attention (not SAGE) and produced a 1280x720 image-to-video result: 49 frames at 24 fps with 20 steps completed in 174 seconds, consuming around 73–74GB VRAM. A longer 121-frame test ran in about 9 minutes with VRAM peaking at 84–85GB and system RAM at ~76GB. The developer notes that quality and prompting need further tuning, but the core goal—running the full Cosmos3 Super model on a single card—was achieved.

The test used an anime-style prompt of a demon queen casting a glowing orb, generating a smooth, cinematic video with consistent character details and lighting. The curl command demonstrates the API-driven workflow via vLLM's video generation endpoint, accepting an input image and detailed prompt with negative prompt guardrails. This milestone is significant for AI creators who want to avoid cloud dependency for high-quality video generation. The RTX PRO 6000's 96GB VRAM is the key enabler, but the need for massive system RAM and swap shows the model's extreme memory appetite during initialization. Once loaded, inference stays within GPU memory for moderate-length clips. The developer plans to test SAGE attention for speed improvements and better prompting techniques to enhance output quality.

Key Points
  • Runs on a single RTX PRO 6000 Blackwell 96GB GPU using BF16 and vLLM with layerwise offloading.
  • 1280x720 video at 24 fps: 49 frames in 174 sec (73GB VRAM), 121 frames in ~9 min (84GB VRAM).
  • Requires 128GB system RAM + 128GB swap file for model loading; stays within VRAM after initialization.

Why It Matters

Brings local, high-quality video generation to professionals, reducing reliance on cloud GPUs.