MIT & Goodfire predict Qwen 3 4B failures with 30× fewer rollouts
New method uses a less-safe variant to spot rare harmful behaviors before release.
A new collaboration between MIT and Goodfire researchers yields a method called Logit Path Extrapolation, which predicts rare harmful behaviors in LLMs with dramatically fewer test rollouts. By interpolating between the Qwen 3 4B model and its 'Abliterated' variant (where the refusal direction is removed), the team builds a path in logit space where harmful compliance is common. They then extrapolate that trend back to the original model, estimating rare failure rates (e.g., 1 in 100,000 rollouts) with 30× fewer rollouts than naive sampling. This beats prior state-of-the-art importance sampling by exploiting the full interpolation path, not just endpoints. The paper demonstrates that labs can leverage their own less-safe checkpoint archives to preemptively catch catastrophic failures before public release, making safety testing far more efficient.
The method works in two steps: first, it interpolates logits between the two models using a parameter α in [0,1], sampling tokens from the blended distribution at each step. Then it measures compliance rates across the interpolation path (where rare behaviors become common) and extrapolates to α=1 (the original model). In tests against HarmBench prompts where Qwen 3 4B never complied in 100 rollouts but did comply in 100,000, Logit Path Extrapolation matched the AUROC of brute-force sampling with 90% fewer rollouts. This approach is broadly applicable to any pair of models where one is a safety-tuned version of the other, a common scenario in LLM development pipelines.
- Logit Path Extrapolation estimates rare harmful behavior rates in Qwen 3 4B using 30× fewer rollouts than naive sampling.
- Method interpolates between the original model and its 'abliterated' (less safe) variant in logit space, then extrapolates compliance rates.
- Outperforms prior importance sampling by using the entire interpolation path instead of just model endpoints.
Why It Matters
Enables AI labs to catch catastrophic failures before deployment, slashing safety testing costs by 30×.