New impossibility proof shows smooth AI scoring can't elicit truth
Lovén and Tarkoma prove any smooth scoring rule leads to agent miscalibration
In a new paper on arXiv, Lauri Lovén and Sasu Tarkoma (University of Helsinki) tackle a core problem in scalable AI oversight: eliciting truthful reports from autonomous agents that also benefit from the report through non-accuracy channels like approval for action or resource allocation. They prove an impossibility result: when the principal uses a strictly proper scoring rule (e.g., Brier, logarithmic) and a non-affine approval function to screen agent types, truthful reporting becomes suboptimal whenever deviation is undetectable. This endogeneity of miscalibration holds for all strictly proper scoring rules, with a closed-form perturbation formula quantifying the bias. The key insight is that any smooth (C¹) oversight function unavoidably distorts the agent's incentives away from calibration.
The good news: the paper offers a constructive escape. A step-function approval threshold—where approval is either fully granted or denied based on a type cutoff—achieves first-best screening for every strictly proper scoring rule. Under the Brier score, the agent's binary inflate-or-not choice makes the type-space threshold independent of the scoring function's curvature, and the authors prove that the Brier score is uniquely optimal: for any non-Brier rule, the welfare gap under smooth oversight is bounded below by Ω(Var(1/G'')(γ/β)²). The paper develops the framework for two domains: AI agent oversight and marketplace mechanism design. The message for AI alignment is direct: smooth scoring-based oversight cannot elicit truth from strategic agents; sharp, step-function thresholds are the only calibration-preserving design.
- Impossibility holds for all strictly proper scoring rules (Brier, log, etc.) when approval function is non-affine
- A step-function approval threshold escapes the impossibility and achieves first-best screening
- Brier score is uniquely optimal: welfare gap under smooth oversight is Ω(Var(1/G'')(γ/β)²) for any non-Brier rule
Why It Matters
For AI safety and oversight design: smooth rewards will fail; threshold-based approval is the only reliable calibration tool.