Teaching an Agent to Sketch One Part at a Time
A novel multi-modal agent builds sketches step-by-step using a new dataset and process-reward reinforcement learning.
A team of researchers has published a paper titled "Teaching an Agent to Sketch One Part at a Time," introducing a novel method for generating vector graphics. The core innovation is a multi-modal language model-based agent trained using a unique two-stage process: supervised fine-tuning followed by multi-turn process-reward reinforcement learning. This training regimen teaches the agent to construct a sketch incrementally, adding one semantic part—like a wheel, a window, or a tree branch—in each step, rather than generating a complete, static image in one go.
This approach is made possible by a new, richly annotated dataset called ControlSketch-Part. The team developed a generic automatic annotation pipeline that segments existing vector sketches into their constituent semantic parts and labels the vector paths associated with each part. By providing the AI agent with this structured part-level data and visual feedback throughout the generation process, the system achieves a high degree of interpretability and user control. The final output is not a rasterized image but a fully editable vector graphic, allowing users to go back and tweak individual components, such as changing the shape of a roof or the size of a car door, long after the initial generation.
- Uses a novel multi-turn process-reward reinforcement learning method to train a multi-modal agent for incremental sketch generation.
- Introduces the ControlSketch-Part dataset, created via an automatic pipeline that segments and labels vector sketches into semantic parts.
- Enables locally editable text-to-vector sketch generation, allowing post-creation modification of individual sketch components.
Why It Matters
This moves AI image generation from static outputs to editable, structured assets, directly useful for designers and illustrators.