Research & Papers

LLMs hide asymmetric racial bias even when outputs appear fair

Activation steering reveals suppressed demographic bias in mortgage AI models.

Deep Dive

A new paper on arXiv (2605.15217) by Jagdish Tripathy and Marcus Buckmann reveals a critical flaw in how we audit AI fairness for high-stakes decisions like mortgage underwriting. Testing open-weight instruction-tuned LLMs on matched applications that differ only by racially-associated names, the researchers found the models produced no output-level bias—yet their internal representations retained and even amplified strong demographic associations across layers.

Through activation steering and novel cross-layer interventions, they showed this suppressed information is decision-relevant: reinjecting it at critical layers caused near-complete decision reversals. Critically, this latent bias is asymmetric—steering affected decisions in one demographic direction strongly, but produced minimal effects in reverse. The bias is also vulnerable to adversarial prompt engineering and parameter-efficient fine-tuning. The authors argue that behavioral audits focused on outputs are insufficient, and propose dual-layer testing frameworks combining output evaluation with representational analysis for AI governance.

Key Points
  • Open-weight LLMs for mortgage underwriting show fair outputs but retain biased internal representations across layers.
  • Activation steering at critical layers reverses decisions asymmetrically—affecting one demographic group far more than the reverse.
  • Latent bias is exploitable via adversarial prompts and fine-tuning, challenging the sufficiency of output-only audits.

Why It Matters

Output fairness checks aren't enough—internal bias audits are essential for safe AI in lending, hiring, and healthcare.