Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
A simple perplexity gap method reveals backdoors and misalignment in 76 models.
A new paper on arXiv (2605.00994) introduces a surprisingly simple method to uncover what a finetuned large language model was actually trained to do, even if that behavior is intentionally hidden. The technique, called perplexity differencing, exploits the tendency of finetuned models to overgeneralize their trained behaviors. Researchers generate diverse completions using random short prompts from general text corpora, then rank those completions by the gap in perplexity between the finetuned model and a reference model. The highest-ranking completions consistently reveal the finetuning objective—whether it’s a backdoor trigger, a false fact, or a hidden harmful instruction.
The method was evaluated on a diverse set of 76 model organisms spanning 0.5B to 70B parameters, including backdoored models, models finetuned on synthetic documents to internalize false knowledge, adversarially trained models with concerning behaviors, and models exhibiting emergent misalignment. In the vast majority of cases, the finetuning objectives surfaced in the top-ranked completions. Notably, the technique does not require access to the original pre-finetuning checkpoint—a trusted reference model from a different family works effectively. Because it only needs next-token probabilities (logprobs), it is compatible with API-gated models that expose this information. This discovery has major implications for auditing and safety: a straightforward, lightweight method can expose hidden risks in finetuned LLMs without expensive internal inspections.
- Method requires only next-token probabilities from the finetuned model, not internal weights.
- Tested on 76 model organisms (0.5B–70B params) including backdoored and adversarially trained models.
- Works with reference models from different model families, not just the original pre-finetuning checkpoint.
Why It Matters
A lightweight perplexity-based audit can expose hidden dangers in finetuned LLMs, improving safety for deployed models.