Strong evidence evals recovered 8% signal under KTO adversarial training, while single-turn evals dropped to 0% for bottom-k sampling?

Strong evidence evals recovered 8% signal under KTO adversarial training, while single-turn evals dropped to 0% for bottom-k sampling.

Bottom-k token sampling optimized strong evidence evals; random sampling was best for single-turn evals across all defenses?

Bottom-k token sampling optimized strong evidence evals; random sampling was best for single-turn evals across all defenses.

KTO reduced strong evidence bottom-k performance by only 54%, versus 100% for single-turn, showing multi-turn formats partially bypass KTO's defenses?

KTO reduced strong evidence bottom-k performance by only 54%, versus 100% for single-turn, showing multi-turn formats partially bypass KTO's defenses.

AI Safety

Llama 70B AuditBench: Strong evidence evals resist adversarial KTO training better than single-turn

LessWrong AI May 16, 2026

⚡Strong evidence evals recover signal under KTO where single-turn evals drop to zero.

Deep Dive

Realmbird ran 84 Llama 3.3 70B Instruct fine-tunes across 14 hidden behaviors (e.g., reward wireheading, flattery) using two instillation methods (transcript distillation, synthetic document fine-tuning) and three adversarial training configs (none, SFT, KTO). Evaluation used NLA verbalizations from activation autoencoders, selecting tokens by top-k MSE buckets (lowest, random, highest) for both single-turn and strong evidence evals.

Key findings: Strong evidence evals were less sensitive to sampling method and more robust to adversarial training. Under KTO, single-turn bottom-k dropped 100% to 0.00, while strong evidence bottom-k dropped only 54% (0.177 → 0.082) and was the only method with above-chance signal. Random sampling was best for single-turn (dropped 93% under KTO). SFT was only a partial defense. Token attribution showed bottom-k tokens have highest impact on AuditBench scores.

Key Points

Strong evidence evals recovered 8% signal under KTO adversarial training, while single-turn evals dropped to 0% for bottom-k sampling.
Bottom-k token sampling optimized strong evidence evals; random sampling was best for single-turn evals across all defenses.
KTO reduced strong evidence bottom-k performance by only 54%, versus 100% for single-turn, showing multi-turn formats partially bypass KTO's defenses.

Why It Matters

Robust eval methods are critical as adversarial training improves; strong evidence evals offer a path to reliable AI safety testing.

Read Original Article

Llama 70B AuditBench: Strong evidence evals resist adversarial KTO training better than single-turn

Why It Matters

Related Articles

🚀 Stay Ahead in AI