VAMPS benchmark: LLMs worse at math with visual tools than without
New benchmark reveals multimodal models fail to exploit their own plotted graphs for math reasoning.
A team of researchers from the University of British Columbia, Simon Fraser University, and others have released VAMPS (Visual-Assisted Mathematical Problem Solving), a new benchmark designed to test whether multimodal LLMs can effectively use self-generated visualizations—specifically plotted graphs—to solve algebra and calculus problems. The benchmark contains 1,168 multiple-choice question-answer pairs in both English and Persian, drawn from Iranian University Entrance Exam problems and augmented with LLM-generated synthetic variants. All problems are selected such that plotting a function reveals key features like intersections, extrema, and asymptotes, making visual reasoning a natural solution strategy.
Across a diverse set of models—including GPT-4o, Claude 3.5, and Gemini—the results were counterintuitive: direct analytical solving (text-only, without tool use) consistently outperformed tool-enabled visual solving, where the model first generated a graph and then reasoned over it. This suggests that current multimodal LLMs struggle to reliably ground their reasoning in self-constructed visual aids, even for problems specifically chosen to benefit from graphing. The finding has significant implications for engineering and scientific workflows that rely on visualization tools for analysis and decision-making, highlighting a fundamental limitation in how LLMs integrate tool use with reasoning.
- VAMPS contains 1,168 bilingual (English/Persian) math problems from Iranian University Entrance Exams, designed for graph-assisted solving.
- Direct analytical solving outperformed tool-enabled visual solving across all tested multimodal LLMs, including GPT-4o, Claude 3.5, and Gemini.
- The benchmark is publicly available on arXiv (2606.04244) and includes both original and LLM-generated synthetic variants validated by human reviewers.
Why It Matters
Exposes a critical flaw: AI can't yet effectively use visual tools to boost math reasoning, limiting real-world engineering applications.