NVIDIA's Cosmos Predict 2.5 fine-tuned with LoRA/DoRA for robot video generation
Parameter-efficient fine-tuning on a single GPU generates synthetic robot trajectories
NVIDIA's Cosmos Predict 2.5 is a large-scale world model capable of generating physically plausible videos conditioned on text, images, or video clips. To adapt it for specific domains like robot manipulation, the team published a guide on parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation) and DoRA (Directional Low-Rank Adaptation). These methods inject small trainable adapters into the frozen 2B-parameter base model, drastically reducing memory requirements—allowing training on a single 80GB GPU. The approach also prevents catastrophic forgetting of general knowledge and keeps adapter files small and portable, enabling flexible switching between multiple domain adapters at inference.
Using the GR00T Dreams post-training recipe dataset (92 robot manipulation videos with text prompts describing pick-and-place tasks), the fine-tuned model can generate synthetic robot trajectories for downstream learning. The training pipeline leverages diffusers, accelerate, and peft libraries, with support for both single- and multi-GPU setups. This drastically cuts the cost and time of collecting real-robot demonstration data, offering robotics teams a scalable way to generate training data for policies. The guide provides complete code and data preprocessing steps, making it practical for engineers to replicate and adapt to their own domains.
- NVIDIA's Cosmos Predict 2.5 is a 2B-parameter world model for physically plausible video generation
- LoRA/DoRA fine-tuning reduces memory requirements, enabling single-GPU training on an 80GB GPU
- Fine-tuned on 92 robot manipulation videos to generate synthetic trajectories for downstream robot learning
Why It Matters
Synthetic robot video generation from a fine-tuned world model slashes data collection costs for training robot policies.