Image & Video

Gen-Searcher: Search-augmented agent for image generation ( Model and SFT-model on huggingface 8B)

This 8B-parameter multimodal agent gathers real-time facts and images before generating, improving scores by 16 points.

Deep Dive

A research team from the Chinese University of Hong Kong (CUHK), UC Berkeley, and UCLA has introduced Gen-Searcher, a novel multimodal AI agent designed to solve a fundamental problem in image generation: outdated or missing knowledge in model parameters. Standard text-to-image models like Stable Diffusion or DALL-E rely solely on their parametric memory, which can be incomplete or stale. Gen-Searcher acts as a front-end agent that first performs multi-hop web searches to gather current textual facts and retrieves relevant reference images, then synthesizes this information into a grounded prompt for an image generator. This search-augmented generation (SAG) approach allows it to handle queries requiring recent events, specific visual details, or niche knowledge that base models lack.

The team developed a dedicated data pipeline to create two training datasets: Gen-Searcher-SFT-10k and Gen-Searcher-RL-6k. They trained the 8B-parameter model first with supervised fine-tuning, followed by agentic reinforcement learning using both text-based and image-based rewards. To evaluate its capability, they introduced KnowGen, a new benchmark specifically for search-dependent image generation. When paired with the Qwen-Image generator, Gen-Searcher improved performance by approximately 16 points on KnowGen and 15 points on the existing WISE benchmark. Crucially, the approach demonstrates transferability, meaning the agent's search-and-grounding logic can enhance other image generators beyond the one it was trained with. The entire project, including the model weights, is fully open-sourced on Hugging Face.

Key Points
  • Gen-Searcher is an 8B-parameter multimodal agent that performs web search and image retrieval before generating images, overcoming parametric memory limits.
  • It was trained on custom datasets using supervised fine-tuning and RL, boosting Qwen-Image's score by ~16 points on the new KnowGen benchmark.
  • The project is fully open-source, and the search-augmented approach is transferable to other image generators beyond the one used in training.

Why It Matters

It enables AI image generation for current events and specific knowledge, moving beyond the static training data of today's models.