LLM-Based Automated Diagnosis Of Integration Test Failures At Google
Google's new LLM tool analyzed over 52,000 test failures, achieving 90.14% root cause accuracy.
Google researchers have developed and deployed a novel AI tool called Auto-Diagnose that uses Large Language Models (LLMs) to automatically diagnose the root cause of integration test failures. Integration tests, which check how different software components work together, generate massive, unstructured logs that are notoriously difficult for developers to parse. Auto-Diagnose tackles this by analyzing these failure logs, identifying the most relevant lines, and producing concise summaries. It is directly integrated into Critique, Google's internal code review system, providing developers with contextual, in-time assistance directly within their workflow.
A manual evaluation on 71 real-world failures demonstrated an impressive 90.14% accuracy in diagnosing the correct root cause. Following its Google-wide deployment, the tool was used to analyze 52,635 distinct failing tests. User feedback was overwhelmingly positive, with the tool being deemed 'Not helpful' in only 5.8% of cases. Among 370 tools that post findings in Critique, Auto-Diagnose ranked #14 in helpfulness. User interviews confirmed the tool's perceived usefulness and the positive reception of integrating AI-powered diagnostic assistance into existing developer workflows.
The research concludes that LLMs are highly effective for this task due to their ability to process and summarize complex textual data. The study also highlights that the tool's high accuracy is a critical factor driving developer adoption and positive perception. This represents a significant step in using AI to reduce cognitive load and save time on tedious debugging tasks, allowing engineers to focus on more creative problem-solving.
- Achieved 90.14% accuracy in root cause diagnosis on 71 real-world test failures.
- Deployed at scale, analyzing 52,635 distinct failing tests with a 'Not helpful' rate of only 5.8%.
- Integrated directly into Google's Critique system, ranking #14 in helpfulness among 370 internal tools.
Why It Matters
Saves developers hours of tedious log analysis, accelerating software development and improving code reliability at scale.