Research & Papers

Compared to What? Baselines and Metrics for Counterfactual Prompting

Gender bias conclusions in LLMs may be statistical noise from surface-form changes

Deep Dive

A new paper from Zihao Yang, Mosh Levy, Yoav Goldberg, and Byron C. Wallace challenges a core assumption in LLM bias evaluation: that counterfactual prompting (changing one attribute like gender) isolates the effect of that attribute. They argue every prompt edit bundles the variable of interest with incidental surface-form variation, a violation of 'treatment variation irrelevance.' In experiments on MedQA, surgically altering patient gender flipped model predictions 14.9% of the time – but simply paraphrasing the original input (keeping gender and content identical) caused 14.1% flips. This difference is statistically insignificant, meaning observed effects cannot be attributed to gender. To address this, the authors propose comparing target-intervention flip rates against a distribution of paraphrasing-induced flip rates using statistical testing (e.g., Fisher's exact test).

Applying this framework to the MedPerturb dataset – previously cited as evidence of LLM sensitivity to demographics and style – they found that only 5 of 120 tests remained statistically significant after controlling for paraphrasing baselines. However, when they tested occupational biography classification, the framework clearly detected directional gender bias (e.g., male-linked pronouns causing higher probability for 'doctor'), proving it can find real small effects. They also evaluated metrics: aggregate flip rates are low-power, per-sample distributional metrics are dramatically stronger, and regression uniquely characterizes both direction and magnitude. This work provides a rigorous methodology for future bias audits and chain-of-thought faithfulness studies.

Key Points
  • Changing patient gender in MedQA produced 14.9% flip rate, indistinguishable from 14.1% from paraphrasing alone
  • Only 5 of 120 bias tests in MedPerturb survived after correcting for general model sensitivity
  • Per-sample metrics are 'dramatically more powerful' than aggregate flip rates for detecting real directional effects

Why It Matters

Demands a new gold standard for LLM bias audits, avoiding false positives from surface-form sensitivity.