New study reveals flaws in AI tool-calling evaluations, faster RL training
How random seeds and system prompts secretly skew leaderboard rankings.
A new paper from Liu et al. (ICML 2026) dives into two critical but often overlooked aspects of tool-calling for LLM agents: effectiveness and efficiency. On the effectiveness side, the researchers systematically audit evaluation pipelines for tool-calling capabilities. They find that reported performance varies dramatically based on seemingly minor implementation choices—random seed, system prompt wording, multi-turn template construction, and how prior reasoning history is carried forward. In multi-turn settings especially, these undocumented variations can flip leaderboard rankings, rendering current comparisons unreliable without rigorous standardization.
On efficiency, the team examines standard reinforcement learning (RL) for tool-calling training. They identify two major sources of computational waste: many rollout prompts produce zero learning signal, and policy updates are unnecessarily expensive. To address this, they introduce two acceleration techniques that prune uninformative rollouts and streamline optimization steps. The result is a substantial wall-clock speedup for RL-based tool-calling training with no degradation in final model performance, offering a practical path toward faster, cheaper agent development.
- Evaluation results for tool-calling are highly sensitive to random seed, system prompt, multi-turn templates, and history handling.
- Standard RL for tool-calling wastes compute on zero-signal rollouts and costly policy updates.
- Two new techniques accelerate RL training wall-clock time significantly without losing performance.
Why It Matters
Flawed benchmarks waste time and money; faster RL training makes agentic AI more practical.