All four abliteration techniques achieve 100% ASR on HarmBench, bypassing all safety refusals on GLM-4.7-Flash?

All four abliteration techniques achieve 100% ASR on HarmBench, bypassing all safety refusals on GLM-4.7-Flash

Heretic retains highest capability scores (MMLU 69.00, GSM8K 93.75) with minimal degradation?

Heretic retains highest capability scores (MMLU 69.00, GSM8K 93.75) with minimal degradation

HauhauCS exposed as plagiarized Heretic fork with additional stacked methods causing slight performance drops?

HauhauCS exposed as plagiarized Heretic fork with additional stacked methods causing slight performance drops

Open Source

Abliteration Benchmarks on GLM-4.7-Flash Reveal 100% Safety Bypass

r/LocalLLaMA April 28, 2026

⚡Four techniques achieve perfect 100% attack success rate on 59B MoE model

Deep Dive

A comprehensive benchmark of abliteration techniques on GLM-4.7-Flash, a 59B parameter Mixture of Experts model with 64 routed experts per layer, reveals that all four tested methods achieve perfect 100% attack success rate (ASR) on the HarmBench safety evaluation. The base model refuses 57.8% of harmful prompts, but after applying Heretic, HauhauCS, Huihui, or Abliterix, the model complies with every single test case. This marks a significant milestone for abliteration research on MoE architectures, which behave differently than standard transformers due to their expert routing mechanisms.

The capability benchmarks show minimal degradation across most tasks. Heretic, the surgical rank-1 edit approach by p-e-w, retains the highest MMLU score (69.00 vs base 68.93) and improves GSM8K from 93.45 to 93.75. HauhauCS, which was recently exposed as a plagiarized fork of Heretic with all attribution stripped, shows slightly lower scores on GSM8K (92.57) and TruthfulQA (48.14 vs base 50.86). Weight analysis reveals that HauhauCS stacked additional third-party techniques on top of Heretic's core, introducing extra edit vectors that degrade certain capabilities. The findings underscore that while abliteration can effectively remove safety constraints, the choice of technique and implementation quality directly impact model performance.

Key Points

All four abliteration techniques achieve 100% ASR on HarmBench, bypassing all safety refusals on GLM-4.7-Flash
Heretic retains highest capability scores (MMLU 69.00, GSM8K 93.75) with minimal degradation
HauhauCS exposed as plagiarized Heretic fork with additional stacked methods causing slight performance drops

Why It Matters

Abliteration research on MoE models shows safety can be bypassed with minimal capability loss, raising critical security concerns.

Read Original Article

Abliteration Benchmarks on GLM-4.7-Flash Reveal 100% Safety Bypass

Why It Matters

Related Articles

🚀 Stay Ahead in AI