Research & Papers

INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

New 1,593-dataset benchmark tests Vision-Language Models on Bahasa Indonesia tables across 4 languages, exposing major performance gaps.

Deep Dive

A team of researchers has launched INDOTABVQA, a new benchmark designed to rigorously test how well Vision-Language Models (VLMs) understand tables within real-world Indonesian (Bahasa Indonesia) documents. The dataset comprises 1,593 document images featuring tables in three distinct visual styles—bordered, borderless, and colorful—and pairs them with 1,593 question-answer sets across four languages: Bahasa Indonesia, English, Hindi, and Arabic. This structure allows for evaluation in both monolingual and challenging cross-lingual scenarios, where a model must extract information from a table in one language to answer a question posed in another.

Benchmarking results on leading models, including proprietary GPT-4o and open-source options like Qwen2.5-VL, Gemma-3, and LLaMA-3.2, revealed significant performance gaps. Models struggled particularly with structurally complex tables and questions in lower-resource languages like Hindi and Arabic. The research demonstrated that targeted fine-tuning is highly effective: a compact 3B parameter model saw an 11.6% accuracy boost, while a LoRA-finetuned 7B model improved by 17.8%. Furthermore, providing explicit spatial coordinates of table regions as input yielded additional 4-7% gains, highlighting the importance of structural awareness for this task.

The findings underscore a critical need for language-diverse and domain-specific datasets to advance global AI capabilities. INDOTABVQA fills a major gap for underrepresented regions and document types, providing a valuable resource to push research in cross-lingual, structure-aware document understanding. By exposing current model weaknesses and proving that targeted adaptation works, this benchmark paves the way for more robust AI tools capable of processing the world's diverse informational formats.

Key Points
  • Dataset includes 1,593 Bahasa Indonesia document images with tables and QA pairs in 4 languages (Indonesian, English, Hindi, Arabic).
  • Benchmarking showed major performance gaps in VLMs like GPT-4o, with fine-tuning yielding up to 17.8% accuracy improvements.
  • Providing explicit table coordinates as spatial priors boosted performance by 4-7%, proving structural awareness is key for table VQA.

Why It Matters

It exposes AI's weakness in global document understanding and provides a roadmap for building better multilingual business and government data processors.