Image & Video

DrawVideo turns storyboard sketches into controllable long videos

Sketch a few keyframes, and AI generates a full narrative video.

Deep Dive

Researchers from the University of Sydney and multiple Chinese institutions (Chuanzhi Xu, Huiqi Liang, Bang Shi, Huiming Zhang, Yifan Xiao, Guangcheng Lin, Haodong Chen, Qiang Qu, Zhicheng Lu, Weidong Cai) have introduced DrawVideo, a novel framework that generates long, coherent videos from storyboard sketches. Unlike existing text-to-video methods that rely on a single lengthy prompt—offering little control over pose, composition, or motion—DrawVideo breaks the video into independently controllable shots. Each shot is defined by three components: a black-and-white sketch (controlling pose and layout), an appearance prompt (defining identity, scene, and style), and a motion prompt (guiding temporal dynamics). The system follows a hierarchical 'global multi-shot, local single-sketch' pipeline: it first generates a structure-aligned reference keyframe, then expands the motion prompt into derivative keyframes representing action states, and finally synthesizes clips between adjacent keyframes to build each shot.

To support this paradigm, the team built SketchLongVideo, the first dataset specifically for sketch-guided text-to-long-video generation. It was constructed from animation videos via shot detection, keyframe extraction, vision-language recognition, prompt decomposition, and sketch conversion. Experimental results demonstrate that DrawVideo achieves strong structural controllability, high appearance consistency, visual stability, and coherent long-video generation—critical advances for creative professionals who need to storyboard and iterate on video narratives. The paper (arXiv:2605.23508) spans 45 pages with 19 figures, covering the full methodology and evaluation.

Key Points
  • DrawVideo decomposes long videos into shots each controlled by a sketch, appearance prompt, and motion prompt.
  • Uses a hierarchical strategy: generate reference keyframe → expand motion into derivative keyframes → synthesize clips between them.
  • Introduces SketchLongVideo, the first dataset for sketch-guided text-to-long-video generation, built from animation videos via shot detection and keyframe extraction.

Why It Matters

Artists and filmmakers can now storyboard complex videos with precise pose and motion control, not just text.