Logit Path Extrapolation estimates rare harmful behavior rates in Qwen 3 4B using 30× fewer rollouts than naive sampling?

Logit Path Extrapolation estimates rare harmful behavior rates in Qwen 3 4B using 30× fewer rollouts than naive sampling.

Method interpolates between the original model and its 'abliterated' (less safe) variant in logit space, then extrapolates compliance rates?

Method interpolates between the original model and its 'abliterated' (less safe) variant in logit space, then extrapolates compliance rates.

Outperforms prior importance sampling by using the entire interpolation path instead of just model endpoints?

Outperforms prior importance sampling by using the entire interpolation path instead of just model endpoints.

AI Safety

MIT & Goodfire predict Qwen 3 4B failures with 30× fewer rollouts

LessWrong AI May 14, 2026

⚡New method uses a less-safe variant to spot rare harmful behaviors before release.

Deep Dive

A new collaboration between MIT and Goodfire researchers yields a method called Logit Path Extrapolation, which predicts rare harmful behaviors in LLMs with dramatically fewer test rollouts. By interpolating between the Qwen 3 4B model and its 'Abliterated' variant (where the refusal direction is removed), the team builds a path in logit space where harmful compliance is common. They then extrapolate that trend back to the original model, estimating rare failure rates (e.g., 1 in 100,000 rollouts) with 30× fewer rollouts than naive sampling. This beats prior state-of-the-art importance sampling by exploiting the full interpolation path, not just endpoints. The paper demonstrates that labs can leverage their own less-safe checkpoint archives to preemptively catch catastrophic failures before public release, making safety testing far more efficient.

The method works in two steps: first, it interpolates logits between the two models using a parameter α in [0,1], sampling tokens from the blended distribution at each step. Then it measures compliance rates across the interpolation path (where rare behaviors become common) and extrapolates to α=1 (the original model). In tests against HarmBench prompts where Qwen 3 4B never complied in 100 rollouts but did comply in 100,000, Logit Path Extrapolation matched the AUROC of brute-force sampling with 90% fewer rollouts. This approach is broadly applicable to any pair of models where one is a safety-tuned version of the other, a common scenario in LLM development pipelines.

Key Points

Logit Path Extrapolation estimates rare harmful behavior rates in Qwen 3 4B using 30× fewer rollouts than naive sampling.
Method interpolates between the original model and its 'abliterated' (less safe) variant in logit space, then extrapolates compliance rates.
Outperforms prior importance sampling by using the entire interpolation path instead of just model endpoints.

Why It Matters

Enables AI labs to catch catastrophic failures before deployment, slashing safety testing costs by 30×.

Read Original Article

MIT & Goodfire predict Qwen 3 4B failures with 30× fewer rollouts

Why It Matters

Related Articles

🚀 Stay Ahead in AI