Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts
AI models struggle to answer questions across multiple charts, new benchmark shows.
Researchers at Iowa State University and Samsung have released PolyChartQA, a mid-scale benchmark designed to test multimodal language models (MLMs) on question answering over multi-chart images. The dataset includes 534 multi-chart images comprising 2,297 sub-charts sourced from peer-reviewed computer science publications, along with 2,694 question-answer pairs. This addresses a critical gap in AI research, as most existing benchmarks focus on single charts, while real-world applications often require integrating information from multiple related visualizations.
Evaluating nine state-of-the-art MLMs, the team found a significant 27.4% drop in LLM-based accuracy (L-Accuracy) on human-authored questions compared to MLM-generated questions, highlighting the difficulty of generalizing to real-world, human-crafted queries. They also introduced a prompting method that yielded a 5.39% accuracy gain. These results underscore the need for more robust multi-chart reasoning capabilities in AI systems, with implications for fields like data analysis, scientific research, and business intelligence.
- PolyChartQA includes 534 multi-chart images with 2,297 sub-charts from computer science papers
- Nine state-of-the-art MLMs showed a 27.4% accuracy drop on human-authored vs. MLM-generated questions
- A new prompting method improved L-Accuracy by 5.39% on the benchmark
Why It Matters
This benchmark reveals critical weaknesses in AI's ability to reason across multiple charts, impacting real-world data analysis.