Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning
A novel multimodal AI system proactively triggers and detects layout defects in split-screen and foldable phone modes.
A research team led by Xinyao Zhang has introduced a new AI-powered framework designed to proactively hunt for GUI display defects in complex multi-window mobile environments, such as split-screen and foldable phone modes. The system moves beyond passive screenshot analysis by actively triggering problematic states during app exploration. It employs the Set-of-Mark (SoM) technique to precisely align screenshots with individual interface widgets, then uses multimodal large language models (LLMs) with chain-of-thought prompting to detect, localize, and explain visual bugs like text truncation and widget occlusion.
The researchers built a benchmark using 50 real-world Android apps and found that multi-window settings dramatically increase layout-related defects. Their results show text truncation issues surge by 184% compared to traditional full-screen testing. At the application level, the method identified 40 defect-prone apps with a 10.00% false positive rate and an 11.11% false negative rate. At a finer granularity, it achieved a top F1 score of 87.2% for detecting when widgets visually overlap or occlude each other, significantly outperforming existing baseline tools like OwlEye and YOLO-based detectors.
- Proactively triggers split-screen, foldable, and transition states to find bugs other methods miss, moving beyond passive analysis.
- Uses Set-of-Mark (SoM) and multimodal LLMs to achieve an 87.2% F1 score for detecting widget occlusion.
- Found text truncation defects increase by 184% in multi-window settings versus full-screen, identifying 40 defect-prone apps.
Why It Matters
This automates QA for complex modern mobile interfaces, ensuring apps work flawlessly on foldables and in multi-tasking modes.