Research & Papers

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

New native Korean benchmark shows top AI models struggle with local conventions, scoring just 42-52% accuracy.

Deep Dive

A research team led by Nahyun Lee has introduced KMMMU, a groundbreaking benchmark designed to evaluate multimodal AI understanding specifically within Korean language and cultural contexts. Unlike translated or English-centric benchmarks, KMMMU contains 3,466 questions natively written in Korean, drawn from real exams across nine academic disciplines including medicine, law, and engineering. The benchmark features nine visual modality categories and includes specialized subsets: 300 Korean-specific questions and a particularly challenging 627-question hard subset. This native approach targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats that global models typically struggle with.

The results reveal significant performance gaps in current AI systems. The strongest open-source model achieved only 42.05% accuracy on the full benchmark, while the best proprietary model managed just 52.42% on the hard subset. Performance varied dramatically across disciplines, with some emerging as particular bottlenecks. Korean-specific questions showed accuracy gaps of up to 13.43% compared to more general questions, highlighting models' weaknesses in localized knowledge. Error analysis suggests these failures stem not from insufficient reasoning depth, but from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding.

KMMMU provides a crucial testbed for developing more reliable AI systems for expert real-world tasks in non-English contexts. The benchmark's focus on native Korean materials—rather than translations—exposes fundamental limitations in how current multimodal models process culturally embedded information. This work challenges the AI community to move beyond English-centric evaluation and develop systems that truly understand local conventions and institutional knowledge, which is essential for deploying AI in professional settings across different linguistic and cultural environments.

Key Points
  • KMMMU contains 3,466 native Korean questions across 9 disciplines with specialized visual formats
  • Best proprietary model scored only 52.42% on hard subset, open-source models reached 42.05% overall
  • Korean-specific questions showed 13.43% accuracy gaps, revealing cultural and institutional knowledge weaknesses

Why It Matters

Exposes critical gaps in AI's understanding of non-English cultural contexts, essential for global professional deployment.