Research & Papers

[R] Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails (arXiv 2603.18280)

Research shows current AI alignment tests miss the crucial, fragile routing mechanism that labs build into models.

Deep Dive

A groundbreaking study published on arXiv (2603.18280) challenges the foundation of how we evaluate AI alignment and safety. Analyzing nine open-weight models from five major labs—including Alibaba's Qwen, DeepSeek, Zhipu's GLM, Microsoft's Phi, and 01.AI's Yi—the research reveals that current benchmark tests, which focus on an AI's refusal to answer, are fundamentally flawed. These tests measure concept detection, but the paper demonstrates that true alignment operates through a separate, learned 'routing' mechanism. This mechanism decides how to steer or censor a response after detection, and it is highly specific to each lab's training process, fragile, and invisible to standard refusal metrics.

The researchers used political censorship in Chinese-origin LLMs as a natural experiment, providing known ground truth. Key findings show that while linear probes for sensitive concepts achieve near-perfect accuracy, this is non-diagnostic. The real test is held-out category generalization. More critically, surgical ablation experiments on four models successfully removed censorship in three of them, allowing for accurate factual output without confabulation. This proves the underlying knowledge was never erased, only rerouted. The exception was Qwen3-8B, where architecture entanglements led to 72% confabulation rates. The study also found that 'routing geometry'—the directional vectors guiding censorship—is lab-specific and does not transfer between models, and that some models like Yi detect content but never installed the routing mechanism at all.

Beyond censorship, the paper's core insight applies universally to AI safety and post-training modifications. Whether for preventing harmful outputs or enforcing brand guidelines, behavioral changes are achieved by modifying this internal routing, not by deleting knowledge. The authors propose a new, rigorous four-level evidence hierarchy for probe-based claims to replace unreliable refusal benchmarks. This framework moves from basic train-set separability up to causal intervention and failure-mode analysis, aiming to set a new standard for robust AI alignment evaluation.

Key Points
  • Surgical ablation removed censorship in 3 of 4 tested models, proving knowledge is preserved but rerouted.
  • Routing mechanisms are lab-specific and fragile; cross-model transfer failed completely (cosine similarity 0.004).
  • Proposes a new 4-level evidence hierarchy for AI safety testing to replace flawed refusal-based benchmarks.

Why It Matters

This exposes critical flaws in how we audit AI safety, requiring a fundamental shift in evaluation methodology for all aligned models.