AIDABench: AI Data Analytics Benchmark
New benchmark with 600+ complex tasks shows even human experts need 1-2 hours per question.
A consortium of 27 researchers has launched AIDABench, a new benchmark designed to rigorously test AI systems on complex, end-to-end data analytics tasks. Unlike previous benchmarks that focus on isolated capabilities, AIDABench simulates real-world scenarios with 600+ diverse tasks across three core dimensions: question answering, data visualization, and file generation. These tasks are grounded in heterogeneous data types like spreadsheets, databases, and financial reports, reflecting analytical demands across various industries. The benchmark's difficulty is underscored by the fact that even human experts, when assisted by AI tools, require 1-2 hours to complete a single question.
In their evaluation of 11 state-of-the-art models—including proprietary systems like Claude Sonnet 4.5 and Gemini 3 Pro Preview, and open-source models like Qwen3-Max-2026—the results were sobering. The top-performing model achieved a pass-at-1 score of only 59.43%, revealing that current AI systems still struggle significantly with the integrated reasoning and execution required for practical data analytics. The team provides a detailed analysis of failure modes and identifies key challenges for future research, positioning AIDABench as a critical tool for enterprise procurement, tool selection, and guiding model development toward solving genuine business problems.
- Benchmark includes 600+ complex tasks across QA, visualization, and file generation using real-world documents.
- Even human experts with AI assistance need 1-2 hours per question, highlighting the benchmark's difficulty.
- Top model (unspecified) scored only 59.43% pass rate, showing a major gap in AI's practical analytics capabilities.
Why It Matters
Provides a rigorous standard for enterprises to evaluate AI tools on real business analytics, not just academic tasks.