Q-ARE: An Evaluation Dataset for Query Based API Recommendation
New dataset from GitHub Java projects reveals AI struggles with multi-level method chains
As software systems grow more complex, developers face the challenge of selecting the right API from hundreds of options. A new paper from Shenglong Wu and colleagues introduces Q-ARE, a dataset designed to evaluate how well query-based API recommendation methods—including general large language models (LLMs)—understand multi-level method invocations. The dataset is built from real open-source Java projects on GitHub, analyzing methods and their invocation chains to identify third-party APIs that are directly or indirectly called. The researchers recursively expanded these chains to unify hierarchical call structures into recommendation target sets.
Q-ARE introduces two novel metrics: API Call Depth, which measures how many levels deep a query method is from the target API in the call hierarchy, and Invocation Density, which quantifies the proportion of code lines in the invocation chain that involve the target API. When they tested several query-based API recommendation methods and general LLMs, they found a sharp drop in performance as API Call Depth increased and Invocation Density decreased. This indicates that existing algorithms still struggle with deep multi-level method structures, a critical flaw for real-world codebases. Accepted at EQUISA 2026, Q-ARE provides a new benchmark for assessing semantic understanding and offers a path toward better AI-powered API recommendation systems.
- Q-ARE dataset built from open-source Java projects on GitHub, analyzing multi-level method invocation chains
- Introduces two metrics: API Call Depth (invocation distance) and Invocation Density (proportion of relevant code lines)
- Evaluation shows significant performance degradation in LLMs and existing methods as call depth increases and density decreases
Why It Matters
Exposes a critical gap in AI code assistants—they fail on deep API call hierarchies, limiting real-world usefulness.