TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation
Even top AI models can't find most software bugs before they cause failures.
Deep Dive
A new benchmark called TestExplora reveals a massive gap in AI's ability to proactively find software bugs. The benchmark contains 2,389 tasks from 482 repositories where models must discover hidden defects by comparing code to documentation. State-of-the-art models achieved a maximum success rate of only 16.06%. Even GPT-5-mini, when used with the SWEAgent framework, only reached 17.27%, showing current AI is largely reactive, not proactive, in software testing.
Why It Matters
This exposes a critical weakness in AI-powered development, meaning most bugs will still slip through automated checks into production.