Llama 70B AuditBench: Strong evidence evals resist adversarial KTO training better than single-turn
Strong evidence evals recover signal under KTO where single-turn evals drop to zero.
Realmbird ran 84 Llama 3.3 70B Instruct fine-tunes across 14 hidden behaviors (e.g., reward wireheading, flattery) using two instillation methods (transcript distillation, synthetic document fine-tuning) and three adversarial training configs (none, SFT, KTO). Evaluation used NLA verbalizations from activation autoencoders, selecting tokens by top-k MSE buckets (lowest, random, highest) for both single-turn and strong evidence evals.
Key findings: Strong evidence evals were less sensitive to sampling method and more robust to adversarial training. Under KTO, single-turn bottom-k dropped 100% to 0.00, while strong evidence bottom-k dropped only 54% (0.177 → 0.082) and was the only method with above-chance signal. Random sampling was best for single-turn (dropped 93% under KTO). SFT was only a partial defense. Token attribution showed bottom-k tokens have highest impact on AuditBench scores.
- Strong evidence evals recovered 8% signal under KTO adversarial training, while single-turn evals dropped to 0% for bottom-k sampling.
- Bottom-k token sampling optimized strong evidence evals; random sampling was best for single-turn evals across all defenses.
- KTO reduced strong evidence bottom-k performance by only 54%, versus 100% for single-turn, showing multi-turn formats partially bypass KTO's defenses.
Why It Matters
Robust eval methods are critical as adversarial training improves; strong evidence evals offer a path to reliable AI safety testing.