Simulating Validity: Modal Decoupling in MLLM Generated Feedback on Science Drawings
Study reveals 41% of AI-generated feedback on science drawings contains grounding errors.
A new study accepted at AIED 2026 tested GPT-5.1 on 150 middle school science drawings covering kinetic molecular theory across five tasks and three competence levels. Researchers generated 300 feedback instances and coded them for four grounding error types: object mismatch, attribute mismatch, relation mismatch, and false absence. They found that 41.3% of all feedback instances contained at least one error, with false absence—treating depicted elements as missing—being the dominant failure mode. Even an inventory-list-first prompting workflow, which asks the model to list objects before generating feedback, reduced overall errors but still left roughly one in three outputs flawed.
The study identifies modal decoupling as a key limitation: MLLM outputs remain pedagogically plausible in form while contradicting the actual visual evidence. Feedback that appears visually grounded offered little diagnostic value for detecting invalid instances—meaning educators cannot easily spot when the AI is wrong. The findings highlight that common prompting strategies are insufficient for reliable grounding in complex visual tasks like student science drawings. For AIED applications, this raises serious concerns about deploying off-the-shelf MLLMs for automated assessment without robust, multi-modal verification mechanisms.
- 41.3% of GPT-5.1 feedback instances on student drawings contained at least one grounding error (object, attribute, relation, or false absence).
- False absence—where the model says an element is missing that is actually present—was the most common failure mode.
- An inventory-list-first prompting workflow reduced error rates but still left roughly 1 in 3 feedback instances flawed.
Why It Matters
Highlights critical limitations in using MLLMs like GPT-5.1 for automated student assessment without robust grounding mechanisms.