A Multi-Agent Approach to Validate and Refine LLM-Generated Personalized Math Problems
Researchers use four specialized AI agents to catch unrealistic quantities and improve authenticity in personalized math problems.
A research team from Worcester Polytechnic Institute and other institutions has developed a novel multi-agent framework to tackle the reliability issues of using large language models (LLMs) like GPT-4 for creating personalized educational content. Published at AIED 2026, the paper addresses a critical problem: while LLMs can efficiently generate math problems tailored to student interests, the initial outputs often contain unrealistic quantities, poor readability, limited authenticity to a student's real-life experience, and even mathematical inconsistencies. The proposed solution formalizes personalization as an iterative generate-validate-revise process.
The core innovation is the deployment of four specialized validator agents, each targeting a specific failure mode: solvability, realism, readability, and authenticity. The system was rigorously tested on 600 problems drawn from the popular online homework platform ASSISTments, personalizing each to 20 different student interest topics. Results showed that authenticity and realism were the most frequent failure points in the raw LLM output. However, a single refinement iteration using the multi-agent feedback loop substantially reduced these failures. The study also compared different strategies for coordinating validator feedback into revisions, finding that different approaches had strengths on different criteria.
Human evaluation was used to assess the reliability of the AI validators themselves. This revealed that the agents were most reliable at judging realism and least reliable at judging authenticity, underscoring the subjective and personal nature of that criterion. This finding highlights a significant challenge for AI in education: the need for evaluation protocols that better account for the unique characteristics of individual teachers and students. The work represents a major step toward making AI-generated educational content more trustworthy and effective at scale.
- Framework uses four specialized AI validator agents targeting solvability, realism, readability, and authenticity to refine LLM output.
- Tested on 600 problems from ASSISTments; a single refinement iteration substantially reduced failures, especially in authenticity and realism.
- Human eval showed validator reliability was highest on realism and lowest on authenticity, pointing to a need for more personalized evaluation.
Why It Matters
This makes AI-generated educational content more reliable and scalable, moving personalized tutoring from a promising concept to a practical tool.