DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use
New research shows diversity in training tasks beats quantity for AI tool-use generalization, even with 4x less data.
A research team from multiple institutions has introduced DIVE (Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use), a novel method that addresses the brittleness of AI agents when faced with new tools and tasks. The core innovation inverts the traditional synthesis process: instead of creating tasks and hoping agents can execute them, DIVE first executes diverse, real-world tools and then reverse-engineers training tasks strictly from the resulting evidence traces. This 'grounding by construction' approach ensures tasks are inherently executable and verifiable.
The method scales diversity along two controllable axes: tool-pool coverage (using 373 tools across five domains) and per-task toolset variety. An Evidence Collection–Task Derivation loop further induces rich, multi-step tool-use patterns. When researchers trained the Qwen3-8B model on DIVE-generated data (48k SFT + 3.2k RL examples), the agent showed remarkable generalization, improving by an average of +22 points across 9 out-of-distribution (OOD) benchmarks and outperforming the strongest 8B baseline by +68 points.
A key finding from the research is that scaling diversity consistently outperforms scaling quantity for OOD generalization. The controlled analysis revealed that models trained on diverse but smaller datasets from DIVE outperformed those trained on larger, less varied datasets, even when the larger dataset contained four times more examples. This challenges the conventional wisdom that simply adding more data is the primary path to better generalization, highlighting the critical importance of structural and compositional diversity in training tasks for robust AI agents.
- DIVE's evidence-driven synthesis executes 373 real-world tools first, then reverse-derives training tasks from the traces for guaranteed grounding.
- Training Qwen3-8B on DIVE data (48k examples) boosted performance by +22 avg. points on OOD benchmarks, beating the top 8B baseline by +68 points.
- Controlled experiments show diversity scaling beats quantity scaling for generalization, with superior results even using 4x less data than larger, less diverse sets.
Why It Matters
This provides a scalable blueprint for creating more robust, generalizable AI agents that can reliably use new tools in real-world, unpredictable environments.