Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry
7B model's refusal mechanisms shift from late to early layers during adversarial training
A new arXiv paper from Wenhao Lan and colleagues investigates how dynamic adversarial fine-tuning (R2D2-style) reshapes the refusal geometry of safety-aligned language models. Using a 7B backbone, they compared standard supervised fine-tuning (SFT) against R2D2 across training steps, measuring harmfulness refusal on HarmBench, over-refusal on XSTest, and probing refusal directions causally.
Results show R2D2 initially achieves perfect refusal (ASR 0.000) by step 50-100, but then partially reopens to 0.250 by step 500. Over-refusal also drops sharply from 1.000 to 0.228. Geometrically, the refusal carrier relocates from late to early layers while maintaining low effective rank (~1.23-1.27). This suggests a reorganization of refusal directions rather than simple drift, implying utility-coupled control in low-dimensional subspaces. The study is limited to one 7B backbone and fixed-source attacks, but offers insights for building more robust safety mechanisms.
- R2D2 drove HarmBench ASR to 0.000 at steps 50-100, then reopened to 0.035 (step 250) and 0.250 (step 500); SFT remained at 0.505-0.588
- XSTest over-refusal (any-refusal) dropped from 1.000 early to 0.664 (step 250) and 0.228 (step 500)
- Refusal carrier shifted from late to early layers; effective rank stayed ~1.23-1.27, indicating low-dimensional control
Why It Matters
Reveals how adversarial fine-tuning evolves refusal mechanisms—key for building safer, less brittle LLM guardrails.