New paper proves PGD, RLHF, and data aug all estimate the same hidden matrix
Fixes sycophancy by 25% and beats adversarial training by 14.8% with one geometric fix.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
After a decade of separate research in domain adaptation, adversarial training, and LLM alignment, one paper proves they all estimate the same underlying matrix: the deployment nuisance covariance. By computing that matrix correctly and adding a geometric penalty term, they dropped Qwen2.5‑7B sycophancy from 38.5% to 13.5% and beat standard PGD adversarial training on CIFAR‑10 by 14.8%, while preserving clean accuracy at 79.4%.
- Theorem G proves that missing any one direction of real-world variance creates a permanent robustnes floor that scales cannot fix.
- PMH loss dropped Qwen2.5-7B sycophancy from 38.5% to 13.5% by preserving hidden state geometry.
- On CIFAR-10 ViT, the method achieved 79.4% clean accuracy versus ~64% for standard PGD adversarial training.
Why It Matters
Unifies safety, adversarial, and alignment research under one matrix — enabling a single fix for multiple robustness problems.