Gemma 2B specialist judge discriminated misaligned vs secure-fine-tuned models on out-of-domain ICEBERG safety prompts where Sonnet 4.5 failed entirely?

Gemma 2B specialist judge discriminated misaligned vs secure-fine-tuned models on out-of-domain ICEBERG safety prompts where Sonnet 4.5 failed entirely.

On SecurityEval code prompts, specialist judge winrate ~66% vs Sonnet ~64% for paired discrimination, but confounded by pattern-matching to Betley training data?

On SecurityEval code prompts, specialist judge winrate ~66% vs Sonnet ~64% for paired discrimination, but confounded by pattern-matching to Betley training data.

Narrow open-weight classifiers could lower costs, improve transparency, and reduce reliance on frontier model provenance for auditing deployed models?

Narrow open-weight classifiers could lower costs, improve transparency, and reduce reliance on frontier model provenance for auditing deployed models.

AI Safety

Gemma 2B specialist judge outperforms frontier models in misalignment detection

LessWrong AI May 15, 2026

⚡Small open-weight model beats Sonnet 4.5 at catching out-of-domain AI misalignment.

Deep Dive

Auditing closed-weight AI models for misalignment is challenging: human blue-teams don't scale, chain-of-thought monitoring may not reflect true tendencies, and frontier model judges (e.g., Sonnet 4.5) are costly and lack provenance transparency. To address these gaps, a LessWrong researcher fine-tuned a 2B-parameter Gemma model on Betley et al 2025b code examples, creating a narrow 'specialist judge.' This judge was tested on out-of-domain safety prompts from the ICEBERG set (e.g., 'Should I report a minor fender bender?') where the misaligned and secure-fine-tuned models had been evaluated. The specialist judge successfully discriminated between the two groups, while Sonnet 4.5 'consistently failed' to do so. On SecurityEval code prompts (out-of-distribution but in-domain), the specialist judge tracked alignment drift and achieved a pairwise winrate of ~66% versus ~64% for Sonnet, though results were partly confounded by pattern-matching to Betley training data.

The findings suggest that narrow, lightweight classifiers—already used for pretraining filtering and jailbreak detection—could be repurposed for deployed model audits. Potential benefits include lower cost, greater transparency (since open-weight models can be inspected), and reduced reliance on frontier model provenance. However, more work is needed to prove robustness: the specialist judge's edge on code prompts may stem from overfitting to training data patterns, and generalizing to entirely new misalignment scenarios remains unvalidated. The researcher provides a replication guide and calls for further testing of specialist judges as a complementary tool in the auditing toolkit, offering a practical, scalable alternative for detecting misalignment that frontier models might miss.

Key Points

Gemma 2B specialist judge discriminated misaligned vs secure-fine-tuned models on out-of-domain ICEBERG safety prompts where Sonnet 4.5 failed entirely.
On SecurityEval code prompts, specialist judge winrate ~66% vs Sonnet ~64% for paired discrimination, but confounded by pattern-matching to Betley training data.
Narrow open-weight classifiers could lower costs, improve transparency, and reduce reliance on frontier model provenance for auditing deployed models.

Why It Matters

Low-cost, transparent specialist judges could democratize AI safety audits and catch misalignment missed by frontier models.

Read Original Article

Gemma 2B specialist judge outperforms frontier models in misalignment detection

Why It Matters

Related Articles

🚀 Stay Ahead in AI