ArXiv paper: Expert-aware steering bypasses safety in MoE LLMs
Steering vectors suppress refusal using a single expert in Mixture-of-Experts models.
A new arXiv paper (under review for COLM 2026) titled "Expert-Aware Refusal Steering" explores how to suppress safety refusal behavior in Mixture-of-Experts (MoE) large language models. The authors extend previous refusal steering techniques—applying a steering vector during inference—to three open-source MoE LLMs. They find that the complex routing patterns in MoE architectures do not inhibit steering performance, and importantly, that refusal behavior can be effectively suppressed by manipulating the output of a single expert in the model.
The researchers propose two expert-aware methods: one that identifies refusal-specific expert routing patterns and another that uses expert-specific steering directions. Their experiments reveal that the signals captured by steering vectors differ from the routing behavior of experts, suggesting that attention mechanisms play a substantial role in MoE refusal behavior rather than the experts themselves. This work highlights potential vulnerabilities in safety alignment for MoE models and points to attack vectors that target specific experts to bypass guardrails.
- Refusal steering works on MoE LLMs without being hindered by complex expert routing.
- Suppressing refusal can be done by targeting the output of a single expert.
- Refusal signals differ from expert routing, implicating attention mechanisms in MoE safety behavior.
Why It Matters
Reveals a single-expert attack vector for bypassing safety in Mixture-of-Experts models, challenging current alignment assumptions.