OPSA uses on-policy rollouts and a frozen teacher with dense KL supervision to reduce distributional mismatch in safety alignment?

OPSA uses on-policy rollouts and a frozen teacher with dense KL supervision to reduce distributional mismatch in safety alignment.

Achieved up to +8.85 points safety improvement on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B without harming reasoning?

Achieved up to +8.85 points safety improvement on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B without harming reasoning.

Introduces 'teacher flip rate' metric to identify safety contexts that activate latent reasoning rather than superficial compliance?

Introduces 'teacher flip rate' metric to identify safety contexts that activate latent reasoning rather than superficial compliance.

Research & Papers

OPSA cuts safety tax in LLMs with on-policy self-distillation

arXiv cs.LG May 18, 2026

⚡New technique preserves reasoning while boosting safety, especially on smaller models.

Deep Dive

Safety alignment in large language models often comes at a cost: reduced reasoning ability, a phenomenon known as the safety tax. Researchers from multiple institutions pinpoint a root cause — distributional mismatch from off-policy training, where models learn from safety demonstrations generated by humans or external models rather than their own outputs. To address this, they propose OPSA (On-Policy Self-Distillation). The model generates its own response trajectories and receives dense per-token KL divergence supervision from a frozen teacher copy of itself, conditioned on a privileged safety context. A novel metric called "teacher flip rate" measures how often the safety context converts an unsafe response into a safe one, enabling the system to discover contexts that activate latent safety reasoning rather than just mimicking safe patterns.

Tested on five model scales across R1-Distill and Qwen3 reasoning model families, OPSA consistently outperformed off-policy self-distillation and external-teacher distillation under matched data and full-parameter fine-tuning. The largest gains were on smaller models: +8.85 points on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B. These improvements held across varying training-set sizes and adaptive jailbreak evaluations. Token-level analysis further revealed that OPSA concentrates gradient updates near early compliance-decision tokens, explaining how it improves safety while preserving general reasoning. This work offers a scalable path to reduce safety tax without sacrificing model performance.

Key Points

OPSA uses on-policy rollouts and a frozen teacher with dense KL supervision to reduce distributional mismatch in safety alignment.
Achieved up to +8.85 points safety improvement on R1-Distill-1.5B and +5.49 points on Qwen3-0.6B without harming reasoning.
Introduces 'teacher flip rate' metric to identify safety contexts that activate latent reasoning rather than superficial compliance.

Why It Matters

Enables safer LLMs without sacrificing reasoning, crucial for deploying capable AI in real-world applications.

Read Original Article

OPSA cuts safety tax in LLMs with on-policy self-distillation

Why It Matters

Related Articles

🚀 Stay Ahead in AI