New LLM bias study reveals Gemini 1.5 hits 72.7% moral sensitivity score
Seven-tier stress test uncovers a U-curve of bias across models and scales.
A new study from researchers including Yash Aggarwal and Aman Chadha introduces the Moral Sensitivity Index (MSI), a metric that quantifies bias in LLMs across a graduated seven-tier stress test—ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5, the team found distinct behavioral signatures. Gemini 1.5 reached 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibited sharp suppression consistent with identity-based safety training.
For mechanistic validation, the researchers selected criminal-bias scenarios and applied logit lens, attention analysis, activation patching, and semantic probing to six models across three capability tiers. Circuit-level analysis revealed a U-curve of bias: small language models (SLMs) exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; however, reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts. This suggests distillation compresses reasoning traces in ways that reactivate shallow statistical associations, providing cross-stage validation that socially loaded cues drive the same bias-driving circuits identified mechanistically.
- Moral Sensitivity Index (MSI) measures graduated bias across seven tiers from abstract math to historical injustice
- Gemini 1.5 reached 72.7% MSI by Tier 5 under socioeconomic framing; Claude suppressed bias via identity-based safety
- Mechanistic analysis shows a U-curve: SLMs have strong criminal bias, instruction-tuning removes it, but reasoning distillation re-introduces it to SLM levels
Why It Matters
Reveals how bias emerges and re-emerges during model distillation, critical for safe LLM deployment in high-stakes domains like criminal justice.