Media & Culture

New paper proves PGD, RLHF, and data aug all estimate the same hidden matrix

Fixes sycophancy by 25% and beats adversarial training by 14.8% with one geometric fix.

Deep Dive

After a decade of separate research in domain adaptation, adversarial training, and LLM alignment, one paper proves they all estimate the same underlying matrix: the deployment nuisance covariance. By computing that matrix correctly and adding a geometric penalty term, they dropped Qwen2.5‑7B sycophancy from 38.5% to 13.5% and beat standard PGD adversarial training on CIFAR‑10 by 14.8%, while preserving clean accuracy at 79.4%.

Key Points
  • Theorem G proves that missing any one direction of real-world variance creates a permanent robustnes floor that scales cannot fix.
  • PMH loss dropped Qwen2.5-7B sycophancy from 38.5% to 13.5% by preserving hidden state geometry.
  • On CIFAR-10 ViT, the method achieved 79.4% clean accuracy versus ~64% for standard PGD adversarial training.

Why It Matters

Unifies safety, adversarial, and alignment research under one matrix — enabling a single fix for multiple robustness problems.