Bits-over-Random metric evaluates tool shortlist depth by comparing success to random chance at that depth?

Bits-over-Random metric evaluates tool shortlist depth by comparing success to random chance at that depth.

RL agent on BFCL shows only 7 tools on average vs. 50 fixed, achieving 90.3% coverage (vs. 90.8%)?

RL agent on BFCL shows only 7 tools on average vs. 50 fixed, achieving 90.3% coverage (vs. 90.8%).

Downstream with Claude Sonnet 4.6?

adaptive lists boost tool selection accuracy from 87.1% to 93.1% vs. fixed 5-tool shortlist.

Research & Papers

Bits-over-Random metric reduces tools shown to LLM agents by 86%

arXiv cs.IR May 26, 2026

⚡A new chance-corrected metric and RL agent cuts tools shown from 50 to 7 without losing accuracy.

Deep Dive

A new research paper from Vyzantinos Repantis and colleagues tackles a fundamental question for LLM agents: how many candidate tools should be shown per query? Too many tools confuse the model, while too few risk missing the correct one. Most systems use a fixed shortlist size, but the authors argue that no standard metric exists to evaluate whether that size was appropriate for a given query. They introduce Bits-over-Random (BoR), a chance-corrected metric that measures whether success at a given depth is better than random selection at that same depth. This metric is then repurposed as a reinforcement learning (RL) reward to train an agent that adaptively selects the shortlist depth per query.

The results are compelling. On BFCL with 370 tools, the learned policy nearly matches the coverage of showing 50 tools (90.3% vs. 90.8%) while presenting only 7 on average—an 86% reduction in tools shown. On ToolBench with 3,251 tools, a fixed shortlist of 5 tools achieves higher aggregate coverage (64.7% vs. 61.9%) but fails entirely on hard queries where the correct tool is ranked 6th-20th. The BoR-driven agent finds 16.7% of those same queries by searching deeper. Downstream validation with Claude Sonnet 4.6 confirms practical benefits: adaptive shorter lists improve the LLM's ability to select the right tool from 87.1% to 93.1% when compared to always showing 5 tools. The gap widens to 76.8% vs. 60.9% on medium-difficulty queries where the correct tool is present but not ranked first, demonstrating that adaptive depth not only saves costs but also boosts accuracy.

Key Points

Bits-over-Random metric evaluates tool shortlist depth by comparing success to random chance at that depth.
RL agent on BFCL shows only 7 tools on average vs. 50 fixed, achieving 90.3% coverage (vs. 90.8%).
Downstream with Claude Sonnet 4.6: adaptive lists boost tool selection accuracy from 87.1% to 93.1% vs. fixed 5-tool shortlist.

Why It Matters

Optimizing tool shortlists improves LLM agent efficiency and accuracy, enabling more complex tasks with fewer resources.

Read Original Article

Bits-over-Random metric reduces tools shown to LLM agents by 86%

Why It Matters

Related Articles

🚀 Stay Ahead in AI