AI Safety

GPT, Claude, DeepSeek, Grok All Show Systematic Budget Bias vs OECD

New study finds LLMs underallocate pensions by 3x and overallocate housing by 4x.

Deep Dive

A new arXiv paper tests four major LLMs—GPT-4o, Claude, DeepSeek, and Grok—on a budgeting task: distributing a fixed social budget across 12 macro-areas in Spain and Germany. Each model was queried six times per country with identical prompts, producing 48 allocations analyzed against OECD reference budgets. The results reveal a systematic implicit social policy that diverges wildly from real European spending. Pensions are under-allocated by a factor of nearly three, while housing and employment are over-allocated by factors of four and two, respectively.

Surprisingly, the main axis of differentiation isn't geopolitical bias (Claude and DeepSeek were the most correlated pair across both countries) but a contrast between budget concentration vs. dispersion. Only Claude showed substantive sensitivity to national context. The authors conclude LLMs can support but not replace expert deliberation in public budgeting—a cautionary finding for governments exploring AI for policy decisions.

Key Points
  • All four models underallocate pensions by ~3x and overallocate housing by 4x vs OECD benchmarks.
  • Claude and DeepSeek showed the highest correlation across countries, not clearly split by geopolitics.
  • Only Claude demonstrated meaningful sensitivity to differences between Spain and Germany.

Why It Matters

LLMs exhibit systematic budget biases that could mislead fiscal policy if used without human oversight.