AI Safety

Assessing the Pedagogical Readiness of Large Language Models as AI Tutors in Low-Resource Contexts: A Case Study of Nepal's K-10 Curriculum

arXiv cs.CY April 14, 2026

⚡Research reveals a 'curriculum-alignment gap' where top AI models struggle with cultural context and simple explanations.

Deep Dive

A team of researchers from Nepal conducted a systematic evaluation of four leading large language models—OpenAI's GPT-4o, Anthropic's Claude Sonnet 4, Qwen3-235B, and Kimi K2—to assess their readiness as AI tutors for Nepal's Grade 5-10 Science and Mathematics curriculum. The study introduced a novel, curriculum-aligned benchmark that broke down pedagogical efficacy into seven binary metrics, including Factual Correctness, Clarity, and Contextual Relevance. The results revealed a stark 'curriculum-alignment gap,' where even the top-performing models like GPT-4o and Claude Sonnet 4, despite achieving approximately 97% aggregate reliability, showed significant deficiencies in explaining concepts clearly to novices and providing culturally relevant examples.

The research identified two critical failure modes: the 'Expert's Curse,' where models could solve complex problems but failed to explain them in simple terms, and the 'Foundational Fallacy,' where performance paradoxically degraded on simpler, lower-grade material due to an inability to adapt to younger learners' cognitive constraints. Furthermore, regional models like Kimi K2 exhibited a 'Contextual Blindspot,' failing to provide culturally relevant examples in over 20% of interactions. These findings challenge the assumption that global AI capabilities can be directly applied to local educational needs without significant adaptation.

Based on these results, the researchers concluded that off-the-shelf LLMs are not yet ready for autonomous deployment in Nepalese classrooms. They propose a 'human-in-the-loop' deployment strategy where AI assists rather than replaces teachers, and offer a methodological blueprint for curriculum-specific fine-tuning. This study provides a crucial framework for evaluating AI educational tools in non-Western, low-resource contexts, emphasizing that technical performance metrics alone are insufficient for real-world educational impact.

Key Points

GPT-4o and Claude Sonnet 4 achieved 97% aggregate reliability but failed on pedagogical clarity and cultural contextualization.
Researchers identified the 'Expert's Curse' failure mode where models solve problems but can't explain them clearly to novices.
Regional model Kimi K2 failed to provide culturally relevant examples in over 20% of interactions, showing a 'Contextual Blindspot'.

Why It Matters

This study reveals that even the most advanced AI models require significant localization and human oversight to be effective educational tools in diverse cultural contexts.

Read Original Article

Assessing the Pedagogical Readiness of Large Language Models as AI Tutors in Low-Resource Contexts: A Case Study of Nepal's K-10 Curriculum

Why It Matters

Stay Ahead in AI