Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
Researchers surgically remove hidden memorization signatures from GPT-2, Mistral-7B, and Pythia models...
A new paper from researchers Rupa and Andy tackles a critical vulnerability in large language models: even after behavioral unlearning, models retain internal traces that adversarial probes can exploit. The team first characterizes where these memorization signatures live — finding consistent gaps of +0.32, +0.19, and +0.30 on Pythia-70M, GPT-2 medium, and Mistral-7B respectively. Crucially, they show the probe direction is causally separable from recall: projecting it out collapses the signature (e.g., from +0.44 to -0.19) while behavioral recall barely changes.
To address this, the authors introduce Probe-Geometry Alignment (PGA), a surgical rank-one intervention per layer that aligns activations opposite the probe’s readout direction. PGA drives cross-sequence probe scores below random chance across all scales — toy depth-4 (0.17), Pythia-70M (0.07), Mistral-7B (0.45), and GPT-2 medium (0.06 via MD-PGA). The method remains robust against six adversarial probe variants and, when extended adversarially, defeats re-fitting attackers at every memorization-relevant depth while preserving five zero-shot benchmarks within 2.8 percentage points (mean Δacc = +0.2pp). This suggests memorization signatures are a real, regime-specific property that can be erased without measurable capability cost.
- Memorization gaps found across scales: +0.32 (Pythia-70M), +0.19 (GPT-2 medium), +0.30 (Mistral-7B)
- PGA drives probe below chance: from +0.44 to -0.19, with scores as low as 0.06 on GPT-2 medium
- Zero-shot benchmarks preserved within 2.8pp (mean Δacc = +0.2pp) against re-fitting attackers
Why It Matters
First method to surgically erase memorization traces from LLMs without harming performance — critical for privacy and unlearning.