Open Source

Apostate abliteration tool matches Heretic but falls short on hardest safety bypass

New abliteration tool removes 98.8% of Qwen 2.5 7B refusals, nearly as good as Heretic.

Deep Dive

Heterodoxin’s new abliteration tool Apostate has entered the ring alongside established tools Heretic and Huihui, targeting the safety training in Qwen 2.5 7B. In a detailed benchmark, Apostate achieved a 98.8% attack success rate (ASR) on HarmBench's 400 harmful behaviors, meaning only 5 requests out of 400 remained refused. Heretic scored a perfect 100%, while Huihui hit 98.2%. This places Apostate as a strong second, though it struggled with the hardest categories: harassment and harmful content.

Strikingly, Apostate and Huihui discovered almost entirely different refusal directions in the model's weights — a cosine similarity of just 0.023. This confirms that Qwen 2.5 7B’s safety layer is not a single switch but has multiple parallel paths to disable. Apostate modifies 55 tensors (16.2% of parameters), compared to Heretic’s 37 (10.9%). On standard benchmarks (MMLU, GSM8K, HellaSwag, etc.), all three tools cause minimal degradation and even slight improvements in GSM8K math reasoning. The code is original work, confirmed by Heretic’s author p-e-w, dispelling plagiarism concerns.

Key Points
  • Apostate achieves 98.8% ASR on HarmBench, leaving 5/400 refusals vs Heretic's perfect 100%.
  • Apostate and Huihui find entirely different refusal directions (cosine similarity 0.023), proving multiple independent safety paths.
  • All tools preserve model capability: GSM8K actually improved across all three abliterations.

Why It Matters

Demonstrates that AI safety guardrails have multiple weaknesses, making robust alignment even harder.