GenEvolve: Self-evolving AI agents beat SOTA in image generation via tool orchestration
AI that learns from its own mistakes to generate better images without human retraining.
GenEvolve redefines open-ended image generation by moving beyond simple prompt-to-image models. Instead, it treats each generation as a tool-orchestrated trajectory where the agent gathers evidence, selects references, invokes skills, and composes a prompt-reference program. The key innovation lies in comparing multiple trajectories for the same request and abstracting best-worst differences into structured visual experience. This experience is fed only to a privileged teacher branch, which then supervises the student model with dense token-level feedback—a technique called Visual Experience Distillation. The result is an agent that continuously self-evolves without manual retraining, improving its search, knowledge activation, reference selection, and prompt construction capabilities over time.
The framework is backed by two new datasets: GenEvolve-Data for training and GenEvolve-Bench for evaluation. Experiments show substantial gains over strong baselines, achieving state-of-the-art performance among current image-generation frameworks. By enabling AI to learn from its own generation attempts and tool use, GenEvolve moves closer to generalist image agents that adapt to increasingly diverse and demanding user requests—without requiring human-labeled data or static fine-tuning.
- GenEvolve treats image generation as a tool-orchestrated trajectory, combining evidence gathering, reference selection, and prompt construction.
- Uses Visual Experience Distillation to provide dense token-level supervision from best-worst trajectory comparisons, enabling self-evolution.
- Achieves state-of-the-art performance on public benchmarks and the new GenEvolve-Bench, outpacing existing agentic generation methods.
Why It Matters
GenEvolve paves the way for AI image generators that improve autonomously, reducing the need for human retraining.