Automating Detection and Root-Cause Analysis of Flaky Tests in Quantum Software
A new AI pipeline uses Google Gemini to find elusive bugs in quantum software with 94% accuracy.
A team of researchers has developed an automated pipeline to tackle a critical problem in quantum software engineering: flaky tests. These are tests that pass or fail inconsistently due to the probabilistic nature of quantum outputs, which can hide real bugs and waste developer time. The system scans quantum software repositories on GitHub, such as those for Qiskit or Cirq, to automatically detect issue reports and pull requests related to test flakiness. It then employs Large Language Models (LLMs) to classify whether a reported issue is genuinely caused by flakiness and to identify its root cause.
The researchers evaluated models from OpenAI's GPT, Meta's LLaMA, Google's Gemini, and Anthropic's Claude suites. Google's Gemini emerged as the top performer, achieving an impressive F1-score of 0.9420 for flakiness detection and 0.9643 for root-cause identification. By applying this AI-powered pipeline, the team discovered 25 previously unknown flaky test cases, increasing the size of an existing benchmark dataset by 54%. This work provides the quantum software community with both an expanded dataset of known flaky tests and a reusable tool for automating their triage.
Future work will focus on improving the robustness of the detection pipeline and exploring the potential for automated repair of these quantum flaky tests. The study demonstrates that modern LLMs can move beyond text generation to provide concrete, high-accuracy support for specialized software engineering tasks in cutting-edge fields like quantum computing.
- The AI pipeline discovered 25 new quantum flaky tests, expanding the known dataset by 54%.
- Google's Gemini model led performance with a 0.942 F1-score for detection and 0.964 for root-cause analysis.
- The tool automates the triage of bug reports in quantum software repos, saving developer time on probabilistic test failures.
Why It Matters
Provides quantum developers with an essential tool to improve software reliability by automatically finding elusive, non-deterministic bugs.