Learning Zero-Shot Subject-Driven Video Generation Using 1% Compute
Only 288 A100 hours vs 200K – a 99% compute reduction.
A new approach to subject-driven video generation (SDV-Gen) slashes compute requirements to just 1% of previous zero-shot methods. The framework, proposed by researchers including Daneul Kim and colleagues, trains on only 200K subject-image pairs and 4,000 arbitrary videos, using 288 A100 GPU hours—compared to 10K-200K hours for earlier models like VACE (0.4% of its compute) and Phantom (2.8%). It works with CogVideoX-5B and transfers to Wan 2.2-5B.
Key innovations include decomposing SDV-Gen into identity injection (learned from subject-image pairs) and motion-awareness (preserved from arbitrary videos), optimized via stochastic switching. Random reference-frame sampling and image-token dropout prevent trivial first-frame copying. Gradient analysis shows the two objectives quickly become nearly orthogonal, enabling stable optimization. The result: competitive subject fidelity and motion quality without per-subject tuning or massive video datasets.
- Uses 288 A100 GPU hours vs 10K-200K hours for prior zero-shot baselines (1% compute).
- Trained on 200K subject-image pairs and 4,000 arbitrary videos—no subject-video pairs needed.
- Decomposes task into identity injection and motion-awareness with stochastic switching for stability.
Why It Matters
Enables custom video generation at a fraction of cost, democratizing access for startups and researchers.