CoMoGen generates realistic interactive videos from simple mask sequences
New framework creates realistic object interactions from just binary mask inputs and one image
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
CoMoGen is a novel video generation framework that takes a single binary mask sequence (indicating where and how objects should move) and a static input image, then produces a video with realistic motion and interactions. The key innovation is a lightweight MaskAdapter that encodes the binary mask sequence into a latent residual signal, which is injected into a Multi Modal Diffusion Transformer (MMDiT) model via a cosine-weighted schedule. Unlike UNet architectures with hierarchical coarse-to-fine design, MMDiT uses uniform transformer blocks, making it hard to identify which layers handle motion. To solve this, the authors propose a method to determine 'Motion Layers' within the attention space of MMDiT, then fine-tune only these layers using Low-Rank Adaptation (LoRA). This selective adaptation keeps the architecture unchanged and reduces computational cost.
Comprehensive experiments show CoMoGen consistently outperforms prior controllable video generation methods in both motion fidelity and perceptual realism, across diverse datasets. The framework enables precise control over subject motion and generates plausible interactions with surrounding humans, objects, and scenes. Its simplicity and efficiency make it a strong candidate for practical applications in content creation, simulation, and robotics, where generating realistic motion from minimal input is crucial.
- CoMoGen uses a lightweight MaskAdapter to encode binary mask sequences into latent residuals for motion injection into MMDiT.
- It identifies 'Motion Layers' in the attention space of MMDiT and fine-tunes them with LoRA, reducing computational cost without architecture changes.
- Achieves state-of-the-art results in motion fidelity and perceptual realism across multiple benchmarks compared to prior methods.
Why It Matters
Enables realistic video generation from minimal input (one image + masks), useful for content creation, simulation, and robotics.