Toward Human-AI Complementarity Across Diverse Tasks
On 1,886 diverse tasks, the complementarity sweet spot is just 8.9%.
A comprehensive study led by Yuzheng Xu and 15 co-authors puts human-AI complementarity to the test on a multi-domain dataset of 1,886 samples spanning knowledge, factuality, long-context reasoning, and deception detection. The researchers evaluate three approaches: simple hybridization (confidence-based routing), top-2 assistance (AI shows two options), and subtask delegation. The results are sobering. Baseline hybridization achieves only a +0.4 percentage point gain over AI alone (69.3% vs 68.9%), because the complementarity region—where AI errs but humans are correct—is just 8.9% of cases, and confidence-based routing fails to identify those cases since model confidence is similarly distributed across correct and incorrect predictions.
The second approach, top-2 assistance applied when AI confidence is low, shows more promise: human accuracy jumps from 28.4% to 38.3%, surpassing AI alone at 37.7%. However, the paper reveals this improvement comes primarily from humans adopting correct AI suggestions, not from humans overriding AI mistakes. In other words, humans tend to trust the AI's second-best option rather than independently catching errors. The authors conclude that the primary bottleneck is not human task accuracy itself, but designing effective routing and assistance methods that actually enable humans to catch AI failures. The paper provides quantitative and qualitative breakdowns for each method and domain, offering concrete targets for future work on AI oversight.
- Baseline hybridization of human and AI judgments yields only +0.4 percentage points over AI alone (69.3% vs 68.9%).
- The complementarity region—where AI errs but humans are correct—is only 8.9% of the 1,886-sample dataset.
- Top-2 assistance lifts human accuracy from 28.4% to 38.3%, but gains come from adopting correct AI suggestions, not catching AI errors.
Why It Matters
For AI oversight, the study shows that human-AI teaming gains hinge on smart routing, not just human accuracy.