Investigating Counterfactual Unfairness in LLMs towards Identities through Humor
New research shows AI models judge jokes differently based on who's telling them, exposing hidden bias.
A team of researchers from KAIST and Seoul National University has developed a novel framework to measure hidden bias in large language models (LLMs) by analyzing their reactions to humor. Published at ACL 2026, the study 'Investigating Counterfactual Unfairness in LLMs towards Identities through Humor' uses a counterfactual method: it swaps the identities of who is telling a joke and who it's about while keeping the joke's content identical. This allows researchers to isolate and measure the model's bias based purely on social perceptions embedded in its training data.
The framework tests models across three tasks: humor generation refusal, speaker intention inference, and predicting a joke's social impact. The results across state-of-the-art models revealed consistent and significant relational disparities. For example, when a joke was framed as being told by a 'privileged' speaker (e.g., a man) about a 'marginalized' target, models were up to 67.5% more likely to refuse to generate it and 64.7% more likely to judge the speaker's intent as malicious, compared to the identity-swapped scenario.
These findings highlight a complex coexistence of over-sensitivity and stereotyping within generative AI. The models demonstrate a form of 'asymmetric fairness,' where they apply stricter social rules to certain identity groups, which ironically reinforces stereotypes. This research provides concrete, interpretable metrics for a subtle form of bias that standard fairness benchmarks often miss, showing that achieving true cultural alignment in AI is more nuanced than simply filtering harmful content.
- Jokes told by 'privileged' speakers were refused by models up to 67.5% more often than identical jokes from other speakers.
- Models judged privileged speakers' intent as malicious 64.7% more frequently, revealing asymmetric application of social rules.
- The study introduces a counterfactual testing framework using humor as a lens to expose relational biases not caught by standard benchmarks.
Why It Matters
This reveals subtle, hard-to-detect bias in AI that affects content moderation, creative tools, and any application where social context matters.