Research & Papers

Enhancing Tool Calling in LLMs with the International Tool Calling Dataset

New benchmark with 17,540 tasks across 40 countries exposes major gaps in AI's ability to use tools globally.

Deep Dive

Researchers Zuoyu Zhang and Yancheng Zhu have published a new paper introducing the International Tool Calling (ITC) dataset, a large-scale benchmark designed to rigorously evaluate how large language models (LLMs) interact with external tools and APIs. The work addresses critical shortcomings in existing benchmarks, which often rely on simulated APIs, lack reproducibility, and fail to represent global diversity. The ITC dataset is substantial, comprising 3,571 real-world APIs and 17,540 distinct tool-calling tasks. These tasks are organized across 20 functional categories and are geographically distributed, representing scenarios from 40 different countries to ensure cultural and linguistic variety.

Experiments conducted with the ITC dataset revealed a substantial performance gap between leading closed-source models (like GPT-4) and open-source alternatives in reliably executing tool calls. More importantly, the research demonstrates that fine-tuning LLMs specifically on the ITC data yields significant improvements, particularly for handling non-English queries and unfamiliar tools. This enhancement leads to better cross-lingual generalization, more consistent reasoning, and increased robustness when models encounter out-of-domain tools—a common challenge in real-world deployments.

The creation of ITC represents a major step forward for the AI research community, providing a standardized, realistic, and globally-aware benchmark. It moves beyond abstract reasoning tests to measure how well AI agents can perform practical actions in complex, multi-tool environments that reflect actual international use cases. This dataset is poised to become a key resource for developers aiming to build more reliable and capable AI agents for applications in customer support, data analysis, and dynamic content generation across different regions and languages.

Key Points
  • Dataset includes 3,571 real APIs and 17,540 tasks across 40 countries and 20 categories.
  • Reveals major performance gaps between open- and closed-source LLMs in real tool-calling scenarios.
  • Fine-tuning on ITC data significantly improves model performance, especially for non-English queries and unfamiliar tools.

Why It Matters

Provides the first major benchmark for building AI agents that can reliably use tools and APIs in a globally diverse, real-world context.