Abliterlitics: Benchmarks and Tensor Comparison for Heretic, Abliterlix, Huiui, HauhauCS for GLM 4.7 Flash
Four techniques achieve perfect 100% attack success rate on 59B MoE model
A comprehensive benchmark of abliteration techniques on GLM-4.7-Flash, a 59B parameter Mixture of Experts model with 64 routed experts per layer, reveals that all four tested methods achieve perfect 100% attack success rate (ASR) on the HarmBench safety evaluation. The base model refuses 57.8% of harmful prompts, but after applying Heretic, HauhauCS, Huihui, or Abliterix, the model complies with every single test case. This marks a significant milestone for abliteration research on MoE architectures, which behave differently than standard transformers due to their expert routing mechanisms.
The capability benchmarks show minimal degradation across most tasks. Heretic, the surgical rank-1 edit approach by p-e-w, retains the highest MMLU score (69.00 vs base 68.93) and improves GSM8K from 93.45 to 93.75. HauhauCS, which was recently exposed as a plagiarized fork of Heretic with all attribution stripped, shows slightly lower scores on GSM8K (92.57) and TruthfulQA (48.14 vs base 50.86). Weight analysis reveals that HauhauCS stacked additional third-party techniques on top of Heretic's core, introducing extra edit vectors that degrade certain capabilities. The findings underscore that while abliteration can effectively remove safety constraints, the choice of technique and implementation quality directly impact model performance.
- All four abliteration techniques achieve 100% ASR on HarmBench, bypassing all safety refusals on GLM-4.7-Flash
- Heretic retains highest capability scores (MMLU 69.00, GSM8K 93.75) with minimal degradation
- HauhauCS exposed as plagiarized Heretic fork with additional stacked methods causing slight performance drops
Why It Matters
Abliteration research on MoE models shows safety can be bypassed with minimal capability loss, raising critical security concerns.