PopResume: Causal Fairness Evaluation of LLM/VLM Resume Screeners with Population-Representative Dataset
New research exposes five discrimination patterns in LLM hiring tools that standard metrics completely miss.
A team of researchers led by Sumin Yu has introduced PopResume, a groundbreaking dataset designed to expose hidden biases in AI-powered resume screening systems. Unlike previous benchmarks that artificially inject demographic data, PopResume contains 60.8K resumes across five occupations, constructed using real population statistics to preserve natural relationships between attributes. The researchers employed causal fairness evaluation using path-specific effects (PSE), which separates the influence of protected attributes into two distinct paths: the 'business necessity path' mediated by job-relevant qualifications, and the 'redlining path' mediated by demographic proxies. This distinction is crucial for determining whether disparities are legally permissible or constitute illegal discrimination.
When testing four large language models (LLMs) and four vision-language models (VLMs) on this dataset, the study uncovered five distinct discrimination patterns that aggregate fairness metrics completely failed to detect. The research demonstrates that standard outcome-level measurements—the kind typically used in AI auditing—can mask significant bias, creating a false sense of fairness. This work represents a major advancement in AI auditing methodology, providing hiring platforms and regulators with tools to distinguish between legitimate qualification-based scoring and illegal discrimination based on demographic proxies embedded in resume formatting, language, or experience patterns.
- PopResume dataset contains 60.8K population-representative resumes across five occupations, using real statistics instead of artificial demographic injection
- Causal fairness evaluation using path-specific effects (PSE) separates legal qualification-based scoring from illegal demographic proxy bias
- Testing of 8 AI models (4 LLMs, 4 VLMs) revealed five discrimination patterns that standard aggregate metrics completely missed
Why It Matters
Provides regulators and employers with tools to audit AI hiring systems for hidden bias that current compliance checks miss.