InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
New model edits videos starting mid-clip using only ~100K training samples, beating open-source benchmarks.
A research team from Tencent and the Hong Kong University of Science and Technology has introduced InsEdit, a novel AI model that transforms video generation backbones into powerful video editors using minimal training data. Built on Tencent's HunyuanVideo-1.5 foundation, InsEdit addresses the critical data scarcity problem in video editing by requiring only approximately 100,000 video editing samples—orders of magnitude less than typical approaches. The breakthrough comes from its Mutual Context Attention (MCA) architecture, which creates precisely aligned video pairs where edits can initiate from any frame, not just the beginning, enabling more natural and flexible editing workflows.
This data-efficient adaptation method allows InsEdit to achieve state-of-the-art performance on video instruction editing benchmarks among open-source models. The training recipe strategically incorporates image editing data alongside video samples, giving the final model the added capability to perform high-quality image editing without any architectural changes or additional training. This dual functionality makes InsEdit a versatile tool for content creators who need consistent editing across both mediums, potentially lowering the barrier to professional-grade video manipulation through simple text instructions.
- Built on Tencent's HunyuanVideo-1.5 and requires only ~100K video samples for training, solving data scarcity
- Uses Mutual Context Attention (MCA) to enable edits starting mid-clip, not just from the first frame
- Achieves SOTA among open-source methods on video editing benchmarks and supports image editing without modification
Why It Matters
Dramatically reduces the data and cost barrier for creating professional AI video editors, enabling more accessible content creation tools.