AI Safety

Open-weight AI models show geographic bias in governance benchmarks

⚑New study finds open-weight LLMs less accurate for underrepresented countries using 24k+ indicators

Deep Dive

A new preprint by Jason Hung tackles a critical flaw in AI governance: geographic bias in large language models. Prior studies on this issue suffered from three methodological weaknesses: reliance on proprietary (closed-weight) systems that cannot be independently replicated, evaluation of model knowledge about years after the models' training data cutoff, and a coarse binary response classification that conflates confident fabrication with honest uncertainty. Hung addresses all three by benchmarking four open-weight frontier language models (weights publicly available) against the Global AI Dataset v2 (GAID v2), a verified ground-truth database of 24,453 indicators spanning 227 countries, published on Harvard Dataverse in January 2026.

Using 18 indicators mapped to the eight thematic dimensions of the IEEE IRAI 2026 framework, Hung creates approximately 2,990 country-metric-year observations across six evaluation years (2010–2023). Model responses are classified into five categories: verified accuracy, confident fabrication (hallucination), honest refusal, qualitative hedging, and misattribution. Geographic disparities in accuracy are estimated via mixed-effects logistic regression and difference-in-differences analysis. The findings underscore that open-weight models still exhibit significant geographic bias, but the transparent methodology enables reproducible auditsβ€”a crucial step for fair global AI governance.

Key Points
  • Benchmarked four open-weight frontier LLMs using GAID v2 with 24,453 indicators across 227 countries
  • Introduced a five-category response classification (VA, HF, HR, QH, MF) replacing coarse binary labels
  • Found significant geographic disparities in accuracy via mixed-effects logistic regression and DiD analysis

Why It Matters

Geographic bias in LLMs undermines equitable AI governance; open-weight benchmarking enables reproducible audits and accountability.

πŸ“¬ Get the top 10 AI stories daily