Research & Papers

MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

A 32B open-source model scores 54.63% on benchmarks, outperforming GPT-5 and Gemini-3-Pro in deep search.

Deep Dive

A research team led by Xiangyu Peng has introduced MTA-Agent, a groundbreaking open-source framework for creating multimodal AI agents capable of complex, evidence-based reasoning. The core innovation is a pipeline that automatically generates high-quality training data for multi-hop question answering, where an AI must perform several steps of search and verification across both visual and textual sources. Starting from existing visual question-answering datasets, the system synthesizes 21,000 verified examples, forming the MTA-Vision-DeepSearch dataset. This data is filtered through a multi-stage process to ensure factual consistency and unique answers, teaching models to select appropriate tools, retrieve evidence, and validate their findings.

When a 32-billion-parameter open-source multimodal model was trained on this dataset, it achieved a state-of-the-art average score of 54.63% across six challenging benchmarks. This performance surpassed closed models like GPT-5 (51.86%) and Gemini-3-Pro (54.46%) under identical tool-use conditions. Crucially, the training improved the model's reasoning depth, increasing the average number of steps in its reasoning process from 2.27 to 4.28, leading to more systematic and persistent search strategies. The team also demonstrated a cost-effective training method using cached tool interactions instead of live API calls.

In a significant move for open AI research, the team has released the entire 'recipe'—including the full 21K dataset, detailed training trajectories, and implementation code—to foster reproducibility and future development. This positions MTA-Agent as a foundational resource for building capable, transparent multimodal search agents that can tackle real-world problems requiring deep visual and textual analysis.

Key Points
  • A 32B open-source model trained on the MTA-Vision-DeepSearch dataset scored 54.63%, outperforming GPT-5 (51.86%) and Gemini-3-Pro (54.46%) on six complex reasoning benchmarks.
  • The system generates 21,000 high-quality multi-hop training examples, teaching AI to use tools for evidence retrieval and validation, increasing average reasoning steps from 2.27 to 4.28.
  • The full dataset, code, and training details are publicly released as an 'open recipe,' enabling reproducible research and development of transparent multimodal agents.

Why It Matters

It provides a proven, open-source blueprint for building AI that can perform complex, evidence-based reasoning across images and text, challenging closed models.