Sakura: Multi-agent AI generates complex tests from natural language
Automated test generation that beats Gemini CLI by 50–78% on compilability.
Testing remains a bottleneck in software development. While existing automated test generators can handle isolated unit tests or simple API endpoints, they struggle to replicate the complex, multi-method tests that developers write manually. Recent evidence shows a large gap between generated and developer-written tests, particularly in call sequences and elaborate assertions. To address this, a team of researchers (Stennett, Pan, McGinn, Orso, and Sinha) introduces Sakura, an agent-based framework that translates natural language (NL) test descriptions into compilable, structurally complex test code. Sakura decomposes NL descriptions into structured blocks and processes them via a multi-agent system: a localization agent uses static analysis to ground test steps in concrete code, a composition agent synthesizes and refines test code using execution feedback, and a supervisor agent coordinates interactions.
For evaluation, the authors curated a dataset of NL test descriptions at three abstraction levels from Apache Commons projects. Across 20 applications and 1,464 test scenarios, Sakura significantly outperformed off-the-shelf agentic tools like Gemini CLI. Specifically, Sakura achieved 50–78% higher test compilability and 38–66% higher coverage overlap with ground-truth tests. Remarkably, Sakura paired with small open-source models (Devstral Small 2, Qwen3-Coder) outperformed Gemini CLI using large proprietary models while being more cost-effective. The paper, accepted to ISSTA 2026, demonstrates a promising step toward bridging the gap between developer intent and automated test generation.
- Uses three specialized agents: localization (static code analysis), composition (code synthesis with execution feedback), and supervisor (coordination).
- Achieved 50–78% higher test compilability and 38–66% higher coverage overlap versus baselines like Gemini CLI.
- With small open-source models (Devstral Small 2, Qwen3-Coder), matches or beats large proprietary models at lower cost.
Why It Matters
Developers can describe test intent in plain English and get complex, compilable test code automatically, saving hours of manual work.