arXiv paper reveals optimal AI audit design under strategic developer response
How to outsmart AI developers who game privacy audits? This paper has the math.
Regulatory audits of AI systems increasingly rely on differential privacy (DP) to shield training data and model internals. But what happens when the audited developer can strategically adapt their mitigation efforts in response to the audit's privacy constraints? A new paper from Florian Burnat on arXiv tackles exactly this problem, formalizing the auditor-developer interaction as a bilevel Stackelberg game. In this model, the auditor first commits to a query policy and DP budget allocation across harm dimensions, and the developer then reallocates mitigation efforts to minimize detected harm while staying within compliance.
The paper introduces a critical metric: the welfare-weighted under-detection gap B_w, which measures the true residual harm the audit fails to catch when the developer is playing optimally. Burnat proves that naive DP auditing—whether uniform or harm-proportional allocation—actually yields a strictly larger B_w than a non-strategic mitigation baseline whenever three conditions hold: effective detectability varies across harm dimensions, welfare weights aren't aligned with detectability, and the developer's optimal response is interior. This is a striking result: the standard approach to privacy budgeting can backfire under strategic behavior.
The optimal auditor allocation, Burnat shows, balances four factors: welfare weight, audit miss-probability, detectability elasticity, and mitigation-cost curvature. To compute this in practice, the paper proposes Strategic Private Audit Design (SPAD), a projected-gradient algorithm that uses hypergradients computed through the developer's best response. The framework gives regulators a principled way to design audit interfaces that are robust to gaming, ensuring that privacy protections don't inadvertently shield harmful behavior.
- The paper models AI audits as a bilevel Stackelberg game where the developer strategically reallocates mitigation in response to the auditor's DP budget allocation.
- Naive DP allocation (uniform or harm-proportional) can produce a strictly larger welfare-weighted under-detection gap compared to non-strategic baselines when detectability is heterogeneous.
- The optimal budget allocation balances four factors: welfare weight, audit miss-probability, detectability elasticity, and mitigation-cost curvature, computed via the SPAD algorithm.
Why It Matters
As AI regulation tightens, this work provides tools to design audits that resist strategic gaming by developers.