Research & Papers

GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation

English-centric agent benchmarks may be causing 32.7% performance drops in other languages.

Deep Dive

Researchers led by Yunsu Kim have released GAIA-v2-LILT, a re-audited multilingual extension of the GAIA agent benchmark that covers five non-English languages. The paper, submitted to arXiv on April 27, 2026, argues that existing multilingual benchmarks rely too heavily on machine translation (MT) and minimal post-editing, which breaks validity for agentic tasks due to query-answer misalignment and culturally off-target context. Their proposed workflow introduces explicit functional alignment, cultural alignment, and difficulty calibration using both automated checks and human review.

In experiments, the GAIA-v2-LILT workflow improved agent success rates by up to 32.7% over minimally translated versions, bringing the closest audited setting to within 3.1% of English performance. However, substantial gaps remain in many other cases, indicating that a significant portion of the multilingual performance gap is benchmark-induced measurement error. The data is available as part of the MAPS package, and the code is publicly released. This work motivates task-level alignment when adapting English benchmarks across languages, providing a more accurate foundation for evaluating multilingual AI agents.

Key Points
  • GAIA-v2-LILT covers five non-English languages with functional, cultural, and difficulty calibration beyond simple translation.
  • Improved agent success rates by up to 32.7% over minimally translated versions, closing the gap to within 3.1% of English in best cases.
  • Highlights that multilingual performance gaps are often benchmark-induced measurement errors, not model limitations.

Why It Matters

This benchmark fix ensures multilingual AI agents are evaluated accurately, not penalized by poor translations.