Multi-Task PPI Framework Slashes Label Needs for AI Audits
Cross-task recalibration tightens confidence intervals when ground truth is scarce...
Many AI evaluation and social science surveys suffer from a label scarcity problem: you need high-quality ground-truth labels for each task (e.g., each prompt or survey question), but collecting them is expensive. Prediction-powered inference (PPI) traditionally treats tasks independently, wasting the shared structure between related tasks. A new paper by Emmenegger, Stahler, and Podimata introduces a multi-task PPI framework that pools labeled data across similar tasks via cross-task recalibration. The key insight is that nonlinear recalibration can exploit common patterns in the proxy-ground-truth relationship, while affine (linear) recalibration offers no benefit over single-task methods. This allows the framework to produce narrower confidence intervals and more accurate estimates when only a handful of labels per task are available.
The authors prove theoretically that efficiency gains beyond power-tuned PPI require nonlinear structure—an important boundary on what's possible. They validate the method on synthetic and semi-synthetic datasets, then apply it to a real-world case study: auditing language models on election-related information during the 2024 U.S. presidential election. Using a large human-annotation dataset, they show that cross-task recalibration substantially reduces confidence interval widths compared to independent PPI. This work directly supports rigorous, cost-effective evaluation of AI systems (e.g., detecting biased outputs across demographic subgroups) and enables social scientists to draw valid conclusions from surveys with many related questions but few annotations per question.
- Introduces multi-task PPI framework that exploits shared structure across related tasks to improve inference from scarce labels
- Proves efficiency gains beyond power-tuned PPI require nonlinear cross-task recalibration; affine recalibration offers no asymptotic benefit
- Case study on LLM auditing for 2024 election info demonstrates substantial shrinkage of confidence intervals with limited human annotations
Why It Matters
Enables statistically valid AI audits and surveys with minimal human labeling, cutting costs while maintaining rigor.