AI Safety

PopResume: Causal Fairness Evaluation of LLM/VLM Resume Screeners with Population-Representative Dataset

arXiv cs.CY March 25, 2026

⚡New research exposes five discrimination patterns in LLM hiring tools that standard metrics completely miss.

Deep Dive

A team of researchers led by Sumin Yu has introduced PopResume, a groundbreaking dataset designed to expose hidden biases in AI-powered resume screening systems. Unlike previous benchmarks that artificially inject demographic data, PopResume contains 60.8K resumes across five occupations, constructed using real population statistics to preserve natural relationships between attributes. The researchers employed causal fairness evaluation using path-specific effects (PSE), which separates the influence of protected attributes into two distinct paths: the 'business necessity path' mediated by job-relevant qualifications, and the 'redlining path' mediated by demographic proxies. This distinction is crucial for determining whether disparities are legally permissible or constitute illegal discrimination.

When testing four large language models (LLMs) and four vision-language models (VLMs) on this dataset, the study uncovered five distinct discrimination patterns that aggregate fairness metrics completely failed to detect. The research demonstrates that standard outcome-level measurements—the kind typically used in AI auditing—can mask significant bias, creating a false sense of fairness. This work represents a major advancement in AI auditing methodology, providing hiring platforms and regulators with tools to distinguish between legitimate qualification-based scoring and illegal discrimination based on demographic proxies embedded in resume formatting, language, or experience patterns.

Key Points

PopResume dataset contains 60.8K population-representative resumes across five occupations, using real statistics instead of artificial demographic injection
Causal fairness evaluation using path-specific effects (PSE) separates legal qualification-based scoring from illegal demographic proxy bias
Testing of 8 AI models (4 LLMs, 4 VLMs) revealed five discrimination patterns that standard aggregate metrics completely missed

Why It Matters

Provides regulators and employers with tools to audit AI hiring systems for hidden bias that current compliance checks miss.

Read Original Article

PopResume: Causal Fairness Evaluation of LLM/VLM Resume Screeners with Population-Representative Dataset

Why It Matters

Stay Ahead in AI