ReBias-Lens framework reveals LLM self-reflection reduces overall bias but amplifies specific categories
Self-reflection smooths bias broadly but stubbornly locks in and worsens localized prejudices.
A new paper introduces ReBias-Lens, a probing framework to understand the internal mechanics of self-reflection in Large Language Models (LLMs) and its effect on social biases. The framework uses a metric called Valence Fluctuation (VF), with two variants: Global-VF tracks macroscopic encoding trends across layers, while Local-VF captures distinctiveness within specific social categories. Testing four LLMs across twelve social categories, researchers found that as layers deepen, overall valence fluctuations undergo a distinct smoothing, leading to a widespread mitigation of bias at the behavioral level.
However, this macro-level reduction masks a more troubling pattern: the reflection mechanism exhibits stubborn, category-specific selectivity. Rather than uniformly correcting biases, it regularly locks in and perversely amplifies localized biases in certain groups. This contradicts the assumption that self-reflection inherently reduces bias. The findings highlight that LLM self-reflection is not a silver bullet—it can entrench certain stereotypes even as it broadly cleanses others—urging caution in deploying autonomous bias mitigation without layer-wise and category-specific monitoring.
- ReBias-Lens uses Valence Fluctuation metrics (Global-VF and Local-VF) to probe bias reconfiguration across transformer layers.
- Tested on 4 LLMs across 12 social categories, overall bias reduces at behavioral level due to layer-wise smoothing.
- Self-reflection paradoxically amplifies localized biases in specific categories, showing 'stubborn category-specific selectivity.'
Why It Matters
Self-reflection in LLMs isn't a universal fix—it can entrench specific biases, demanding careful monitoring and targeted mitigation.