Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models
You can now literally 'mute' unwanted concepts like refusal or Chinese mixing in Qwen 3.5 models.
Qwen Team has open-sourced Qwen-Scope, a suite of Sparse Autoencoders (SAEs) designed for their entire Qwen 3.5 model family—from the 2B parameter version up to the 35B Mixture-of-Experts model. Think of an SAE as a decoder that translates the model's high-dimensional internal activations into human-interpretable features: each feature corresponds to a distinct concept, such as 'legal discourse', 'Python code syntax', or even 'refusal behavior'. The team has mapped these features for the residual stream across every layer, giving researchers and developers an unprecedented window into the model's neural activity.
What can you actually do with this? The most compelling application is 'surgical abliteration'—instead of the blunt 'mean difference' method for removing safety filters (which often degrades reasoning), you can now locate the exact feature ID responsible for moralizing or refusals and selectively suppress only that dimension. Additionally, feature steering lets you force-activate specific concepts during generation (e.g., injecting a 'Classical Literary Style' direction), while model debugging helps identify why a model suddenly switches languages—as in the demo where Feature #6159 (Chinese language) over-activated. Qwen's caution statement forbids removing safety filters despite the Apache 2.0 license, but the technical capability is there for fine-tuning diagnostics and controlled generation.
- Maps internal features across all layers for Qwen 3.5 models (2B to 35B MoE), creating a 'concept dictionary' with IDs like 'Feature #6159' (Chinese language).
- Enables surgical abliteration: suppress refusal/moralizing features precisely without the reasoning loss of traditional methods.
- Supports feature steering, model debugging (e.g., why Chinese mixing happens), and fine-tuning dataset analysis via a Hugging Face Space demo.
Why It Matters
Developers get direct control over model behavior—turning dials on concepts instead of fighting with prompts.