Using Vision + Language Models to Predict Item Difficulty
Researchers use OpenAI's GPT-4.1-nano to analyze data visualizations and text, predicting question difficulty with 34% more accuracy.
A new research paper demonstrates how large language models (LLMs) can be used to automate the complex task of predicting test item difficulty. Researcher Samin Khan leveraged OpenAI's GPT-4.1-nano to analyze data visualization literacy test items, which combine graphical charts with textual questions and answers. The study systematically compared three approaches: analyzing only the text, only the visualization image, or a multimodal combination of both. The goal was to predict the empirical difficulty of each item—the proportion of U.S. adults who would answer it correctly—showcasing a novel application of AI in educational measurement and psychometrics.
The results were clear: the multimodal model, which processed both visual and textual features, was significantly more accurate. It achieved a mean absolute error (MAE) of 0.224, a 20% improvement over the vision-only model (0.282 MAE) and a 34% improvement over the text-only model (0.338 MAE). When applied to a held-out test set, the best model achieved a mean squared error of 0.10805. This research, published on arXiv, highlights the practical potential of lightweight models like GPT-4.1-nano for automating labor-intensive tasks in test development, potentially speeding up the creation of balanced assessments and providing rapid feedback on item design before human trials.
- Multimodal GPT-4.1-nano analysis achieved a 0.224 MAE, beating unimodal methods by 20-34%.
- The model predicts difficulty for data visualization literacy items, combining chart images with text.
- Demonstrates potential for automated psychometric analysis, reducing manual effort in test development.
Why It Matters
Automates time-consuming test design, allowing educators and researchers to develop better assessments faster.