ToolWeave boosts LLM agent training with 45% more multi-step tool calls
New framework cuts parameter hallucinations and beats prior methods by 16 points on benchmarks.
IBM Research has released ToolWeave, a structured framework for synthesizing high-quality, multi-turn tool-calling dialogues. Current synthetic data pipelines often fail because they chain tools that are only superficially compatible, or generate dialogues in one shot, leading to unrealistic arguments and an underrepresentation of multi-step interactions. ToolWeave addresses these issues by constructing tools with built-in dependencies and filtering workflows based on user goal alignment. It uses a fine-grained planning stage that explicitly tracks parameter provenance, drastically reducing parameter hallucinations.
Results show that ToolWeave-generated dialogues contain 45% more multi-step tool interactions and fewer hallucinations in both parameters and tool names. When fine-tuned on this data, Llama-3.1-70B achieves 39.75% on the BFCL-V3 multi-turn benchmark, a significant jump from 23.50% when trained on previous state-of-the-art ToolFlow data. This enables more reliable autonomous agents capable of executing complex, realistic user requests across multiple tools.
- ToolWeave generates 45% more multi-step tool interactions than prior synthetic data pipelines.
- Llama-3.1-70B fine-tuned on ToolWeave scores 39.75% on BFCL-V3 multi-turn vs. 23.50% with ToolFlow.
- Fine-grained planning explicitly tracks parameter provenance to reduce hallucinations in tool arguments.
Why It Matters
High-quality synthetic data is the bottleneck for turning LLMs into reliable autonomous agents — ToolWeave clears it.