Assessing the Pedagogical Readiness of Large Language Models as AI Tutors in Low-Resource Contexts: A Case Study of Nepal's K-10 Curriculum
Research reveals a 'curriculum-alignment gap' where top AI models struggle with cultural context and simple explanations.
A team of researchers from Nepal conducted a systematic evaluation of four leading large language models—OpenAI's GPT-4o, Anthropic's Claude Sonnet 4, Qwen3-235B, and Kimi K2—to assess their readiness as AI tutors for Nepal's Grade 5-10 Science and Mathematics curriculum. The study introduced a novel, curriculum-aligned benchmark that broke down pedagogical efficacy into seven binary metrics, including Factual Correctness, Clarity, and Contextual Relevance. The results revealed a stark 'curriculum-alignment gap,' where even the top-performing models like GPT-4o and Claude Sonnet 4, despite achieving approximately 97% aggregate reliability, showed significant deficiencies in explaining concepts clearly to novices and providing culturally relevant examples.
The research identified two critical failure modes: the 'Expert's Curse,' where models could solve complex problems but failed to explain them in simple terms, and the 'Foundational Fallacy,' where performance paradoxically degraded on simpler, lower-grade material due to an inability to adapt to younger learners' cognitive constraints. Furthermore, regional models like Kimi K2 exhibited a 'Contextual Blindspot,' failing to provide culturally relevant examples in over 20% of interactions. These findings challenge the assumption that global AI capabilities can be directly applied to local educational needs without significant adaptation.
Based on these results, the researchers concluded that off-the-shelf LLMs are not yet ready for autonomous deployment in Nepalese classrooms. They propose a 'human-in-the-loop' deployment strategy where AI assists rather than replaces teachers, and offer a methodological blueprint for curriculum-specific fine-tuning. This study provides a crucial framework for evaluating AI educational tools in non-Western, low-resource contexts, emphasizing that technical performance metrics alone are insufficient for real-world educational impact.
- GPT-4o and Claude Sonnet 4 achieved 97% aggregate reliability but failed on pedagogical clarity and cultural contextualization.
- Researchers identified the 'Expert's Curse' failure mode where models solve problems but can't explain them clearly to novices.
- Regional model Kimi K2 failed to provide culturally relevant examples in over 20% of interactions, showing a 'Contextual Blindspot'.
Why It Matters
This study reveals that even the most advanced AI models require significant localization and human oversight to be effective educational tools in diverse cultural contexts.