Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment
Just 0.3% of evaluation data predicts user satisfaction better than full benchmarks.
Evaluating large audio models (LAMs) is expensive and often misaligned with what users actually want. A new paper from Stanford researchers tackles both problems head-on. Led by Woody Haosheng Gan, the team analyzed 10 subset selection methods across 18 audio models and 40 tasks. They discovered that curated subsets of just 50 examples (0.3% of the full data) achieve a 0.93 Pearson correlation with complete benchmark scores. Even more striking, when they tested how well those scores matched real user satisfaction — using 776 human preference ratings from realistic voice assistant conversations — both the subsets and the full benchmark only reached 0.85 correlation.
To close that gap, the researchers trained regression models on the selected subsets. These models achieved a 0.98 correlation with human preferences, significantly outperforming regression models trained on random subsets or even the entire benchmark dataset. The key insight: well-curated data beats massive data when predicting what people actually like. The team open-sourced their regression-weighted subsets as the HUMANS benchmark, offering an efficient proxy that captures both raw performance and user satisfaction. This could dramatically reduce the cost and complexity of comparing audio AI models while ensuring evaluations reflect real-world usage.
- Subsets of 50 examples (0.3% of full data) achieve 0.93 Pearson correlation with full benchmark scores across 18 models and 40 tasks.
- Regression models trained on curated subsets hit 0.98 correlation with 776 human preference ratings, beating both random subsets and the full benchmark.
- The open-source HUMANS benchmark provides an efficient, preference-aligned proxy for LAM evaluation, reducing cost and data redundancy.
Why It Matters
Enables cheap, user-centric evaluation of audio AI models, accelerating development while ensuring real-world satisfaction.