AgentSearchBench: A Benchmark for AI Agent Search in the Wild
10,000 real-world agents reveal a gap between descriptions and actual performance
AgentSearchBench, a new benchmark from researchers Bin Wu, Arastun Mammadli, Xiaoyu Zhang, and Emine Yilmaz, tackles the challenge of finding the right AI agent for a task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them hard to assess from text alone. The benchmark uses nearly 10,000 real-world agents from multiple providers, formalizing agent search as retrieval and reranking problems. It evaluates relevance using execution-grounded performance signals, not just semantic similarity.
Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. However, lightweight behavioral signals, including execution-aware probing, substantially improve ranking quality. This highlights the importance of incorporating execution signals into agent discovery. The code is available on GitHub, and the paper is on arXiv.
- Benchmark built from nearly 10,000 real-world agents across multiple providers
- Formalizes agent search as retrieval and reranking under executable queries and high-level descriptions
- Lightweight behavioral signals (execution-aware probing) substantially improve ranking quality
Why It Matters
Better agent search means faster, more reliable AI task delegation for professionals.