Benchmarked four open-weight frontier LLMs using GAID v2 with 24,453 indicators across 227 countries?

Benchmarked four open-weight frontier LLMs using GAID v2 with 24,453 indicators across 227 countries

Introduced a five-category response classification (VA, HF, HR, QH, MF) replacing coarse binary labels?

Introduced a five-category response classification (VA, HF, HR, QH, MF) replacing coarse binary labels

Found significant geographic disparities in accuracy via mixed-effects logistic regression and DiD analysis?

Found significant geographic disparities in accuracy via mixed-effects logistic regression and DiD analysis

AI Safety

Open-weight AI models show geographic bias in governance benchmarks

arXiv cs.CY June 26, 2026

⚡New study finds open-weight LLMs less accurate for underrepresented countries using 24k+ indicators

Deep Dive

A new preprint by Jason Hung tackles a critical flaw in AI governance: geographic bias in large language models. Prior studies on this issue suffered from three methodological weaknesses: reliance on proprietary (closed-weight) systems that cannot be independently replicated, evaluation of model knowledge about years after the models' training data cutoff, and a coarse binary response classification that conflates confident fabrication with honest uncertainty. Hung addresses all three by benchmarking four open-weight frontier language models (weights publicly available) against the Global AI Dataset v2 (GAID v2), a verified ground-truth database of 24,453 indicators spanning 227 countries, published on Harvard Dataverse in January 2026.

Using 18 indicators mapped to the eight thematic dimensions of the IEEE IRAI 2026 framework, Hung creates approximately 2,990 country-metric-year observations across six evaluation years (2010–2023). Model responses are classified into five categories: verified accuracy, confident fabrication (hallucination), honest refusal, qualitative hedging, and misattribution. Geographic disparities in accuracy are estimated via mixed-effects logistic regression and difference-in-differences analysis. The findings underscore that open-weight models still exhibit significant geographic bias, but the transparent methodology enables reproducible audits—a crucial step for fair global AI governance.

Key Points

Benchmarked four open-weight frontier LLMs using GAID v2 with 24,453 indicators across 227 countries
Introduced a five-category response classification (VA, HF, HR, QH, MF) replacing coarse binary labels
Found significant geographic disparities in accuracy via mixed-effects logistic regression and DiD analysis

Why It Matters

Geographic bias in LLMs undermines equitable AI governance; open-weight benchmarking enables reproducible audits and accountability.

Read Original Article

Open-weight AI models show geographic bias in governance benchmarks

Why It Matters

Related Articles

🚀 Stay Ahead in AI