Image & Video

NVIDIA Cosmos3 Nano generates 720p video in 9 minutes on dual RTX 3090s

Dual RTX 3090 setup cranks out 161 frames in 9 minutes – but 1280p triggers OOM

Deep Dive

A Reddit user benchmarked NVIDIA's Cosmos3 Nano model using vllm-omni on a system with dual RTX 3090s (48GB VRAM total), a Ryzen 9 9950X, and 128GB DDR5 RAM. With tensor-parallel-size 2 and SAGE_ATTN diffusion attention backend, the setup generated a 720x720, 161-frame video in exactly 9 minutes using 20 inference steps and guidance scale 6.0. The prompt requested a stop-motion yarn turtle video; the output was saved locally. However, attempting 1280x720 resolution caused an out-of-memory error during video decoding, though the generation phase completed without issues.

Quality-wise, the Nano variant underperforms for text-to-video (T2V) tasks, exhibiting noticeable smearing and object distortion. Image-to-video (I2V) results are ‘totally acceptable’ by the tester's account, though artifacts remain. The user highlighted the model's potential for local video generation but noted that professional-grade results may require the larger Cosmos3 model. The entire stack used vllm-omni's omnimodel support and no guardrails config. This DIY setup demonstrates that local video generation is feasible with consumer hardware, but resolution and quality trade-offs are still significant.

Key Points
  • Generated a 720x720, 161-frame video in 9 minutes on dual RTX 3090s using vllm-omni with tensor-parallel-size 2
  • 1280x720 resolution triggers OOM error during video decoding, not during generation
  • Image-to-video quality is acceptable; text-to-video shows smearing and artifacts due to Nano model limitations

Why It Matters

Local video generation is now possible with dual consumer GPUs, but quality and resolution constraints remain for professional use.