LongDS-Bench reveals AI agents fail at long-horizon data analysis with 48% accuracy
Even the best AI agent drops 47% accuracy from early to late turns in 11-turn tasks.
A new benchmark, LongDS-Bench, exposes a critical weakness in current AI agents: they fail spectacularly when asked to perform long, iterative data analysis. Created by researchers from Zhejiang University and collaborators, the benchmark comprises 68 real-world Kaggle tasks spanning 2,225 turns across six domains including Geoscience, Business, and Education. Unlike existing tests that evaluate isolated steps, LongDS forces agents to maintain, update, restore, and compose evolving analytical states over an average dependency span of 11.3 turns.
Evaluating five state-of-the-art models, the best performer—likely a GPT-4 class model—reached only 48.45% average accuracy. Performance dropped nearly 47 percentage points from early to late turns, and long-horizon errors accounted for 52-69% of all failures. Strikingly, adding more agent steps did not improve results; the key bottleneck is correctly maintaining analytical state, not increasing interaction budget. This suggests current AI architectures lack reliable mechanisms for long-term memory and state tracking in complex analytical workflows.
- LongDS-Bench includes 68 real-world Kaggle tasks with 2,225 turns across Geoscience, Business, Education, and 3 other domains.
- Best model accuracy: 48.45%; performance drops 47 points from early to late turns (average 11.3-turn dependency span).
- 52-69% of failures are due to long-horizon errors; additional agent steps do not improve performance.
Why It Matters
Highlights that current AI agents cannot reliably handle multi-step data analysis, limiting their use in real-world business intelligence.