Experiments on Refusal Shape in LLMs
New experiments reveal AI safety controls are more complex than previously thought, with distinct refusal mechanisms for different harm types.
ValueShift Research has published experimental results challenging a foundational assumption about how large language models refuse harmful requests. The team reproduced the original 'Refusal in Language Models Is Mediated by a Single Direction' paper's methodology but then expanded testing across different domains. Using Llama-2-7b-chat with 280+ carefully crafted prompts across safety-related and policy-related categories, they extracted refusal vectors and calculated cosine similarity matrices. The results showed three distinct refusal directions rather than a single universal one, contradicting the original paper's central claim.
The research team created specific prompt categories including 80 benign safety instructions, 100 harmful safety instructions, 100 benign policy prompts, and specialized harmful policy instructions designed to trigger refusal mechanisms. They used the same Llama-2-7b-chat model from the original study, running quantized versions on Colab A100 GPUs with consistent formatting. The findings suggest that refusal mechanisms are domain-specific rather than generalized, with different neural pathways activating for safety violations versus policy violations. This has significant implications for how researchers approach AI safety interventions and model alignment techniques.
- Found 3 distinct refusal vectors in Llama-2-7b-chat across different harm domains, not a single universal direction
- Tested with 280+ prompts across safety and policy categories using reproducible Colab notebook methodology
- Results challenge the 'single direction' hypothesis and suggest more complex safety architectures in LLMs
Why It Matters
Understanding refusal mechanisms is crucial for developing effective AI safety controls and alignment techniques across different harm types.