Research & Papers

Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

A massive study of 23,404 people shows AI preferences vary dramatically by age, with Gemini 2.5 Pro winning overall.

Deep Dive

A team of researchers led by Nora Petrova and Andrew Gordon has published a landmark paper at ICLR 2026 introducing the HUMAINE (Human-AI Interaction Measurement) framework. This new approach tackles the critical shortcomings of current LLM evaluation, which often relies on narrow technical benchmarks or unrepresentative human feedback. HUMAINE shifts the paradigm by implementing a multidimensional, demographically-stratified evaluation, collecting naturalistic conversations from a massive, census-aligned sample of 23,404 participants in the US and UK to assess 28 state-of-the-art models.

Using a hierarchical Bayesian model, the study yielded three major insights. First, it established a clear performance hierarchy, with Google's Gemini 2.5 Pro emerging as the top-ranked model with a 95.6% posterior probability of being the best. Second, it uncovered significant preference heterogeneity, revealing that a user's age is the primary demographic factor causing disagreement; a model's rank can shift substantially across age groups, exposing generalization failures masked by typical evaluations. Third, it quantified vast differences in how easily humans can judge AI on various dimensions, with ambiguous qualities like 'Trust, Ethics & Safety' resulting in a 65% tie rate, compared to a decisive 10% tie rate for picking an 'Overall Winner.' The researchers have released their complete dataset, an interactive leaderboard, and the open-source framework to push the field toward more responsible and representative assessment.

Key Points
  • Google's Gemini 2.5 Pro ranked first overall with a 95.6% posterior probability of being the top model among 28 tested.
  • User age was the primary demographic axis for preference disagreement, showing model ranks shift significantly across different age groups.
  • Evaluation dimension 'Trust, Ethics & Safety' had a 65% tie rate, highlighting the challenge of measuring ambiguous qualities versus a 10% tie rate for 'Overall Winner'.

Why It Matters

This research proves AI model performance is not universal and demands evaluations that account for diverse user demographics to avoid biased, misleading rankings.