Sketch2Feedback: Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams
New grammar-in-the-loop system beats LLaVA and Qwen2-VL at providing accurate, actionable feedback on student sketches.
A new research paper introduces Sketch2Feedback, a groundbreaking AI framework that addresses the critical problem of hallucination in educational AI systems. Developed by researcher Aayam Bansal, the system tackles the persistent challenge of providing accurate, rubric-aligned feedback on student-drawn STEM diagrams like free-body diagrams and circuit schematics.
The framework employs a novel 'grammar-in-the-loop' architecture that decomposes the feedback process into four distinct stages: hybrid perception, symbolic graph construction, constraint checking, and constrained VLM feedback. This structured approach ensures that the language model only verbalizes violations verified by an upstream rule engine, dramatically reducing hallucinations. In evaluations on two synthetic benchmarks (FBD-10 and Circuit-10 with 500 images each), the system demonstrated remarkable improvements over end-to-end large multimodal models. While Qwen2-VL-7B achieved the highest micro-F1 scores (0.570 on FBDs, 0.528 on circuits), it suffered from extreme hallucination rates of 78% and 98% respectively. In contrast, Sketch2Feedback's ensemble approach achieved F1=0.556 with hallucination reduced to 32% on FBDs.
The research reveals important domain-specific insights: free-body diagram detection proved resilient to noise augmentation, while circuit detection degraded sharply under challenging conditions. Most significantly, an LLM-as-judge evaluation confirmed that the grammar pipeline produces substantially more actionable circuit feedback (4.85/5) compared to end-to-end LMMs (3.11/5). The team has released all code, datasets, and evaluation scripts, enabling further development in this critical area of educational technology. This work represents a major step toward trustworthy AI deployment in classroom settings where accuracy and reliability are paramount.
- Reduces hallucination rates from 98% in Qwen2-VL-7B to 32% while maintaining F1 scores around 0.556
- Uses four-stage grammar-in-the-loop pipeline to constrain VLM feedback to verified violations only
- Achieves 4.85/5 rating for actionable circuit feedback versus 3.11/5 for standard LMMs in LLM-as-judge evaluation
Why It Matters
Enables reliable AI-assisted grading in STEM education by dramatically reducing hallucinations that undermine trust in classroom deployments.