Gemma 4 E2B abliteration test: 13 variants, 96% ASR, one actually beats base on math
44 GPU hours on a single RTX 5090 reveal which safety-removed variants keep (or improve) capabilities.
A comprehensive analysis of 13 abliterated variants of Google's Gemma 4 E2B-instruct model reveals that safety removal is now trivial but capability preservation remains inconsistent. Using a single RTX 5090 over 44 GPU hours, the researcher performed weight forensics, KL divergence measurements, a 400-prompt HarmBench evaluation (reviewing all 5,600 responses via LLM), and 8 benchmark tasks through lm-eval on native BF16. Every variant lifted the base model's 32.2% Attack Success Rate (ASR) to between 82% and 100% — five reached 99%+ and treadon achieved 100% with zero refusals.
More surprisingly, two variants actually outperformed the base model on GSM8K math reasoning: coder3101 scored 84.8% vs 83.5% and llmfan46 scored 83.9%. The abliteration shortened thinking chains, allowing more correct answers within a fixed token budget. However, aggressive approaches caused significant degradation — ether4o4 dropped 6.9 points on GSM8K and produced 84 empty responses. LAMBADA perplexity varied wildly; wangzhang hit 7.35x base perplexity. Several creators' claims were found misleading: duoneural claimed 'near-zero divergence at ~0.001' but actual measured 0.187 (187x higher); wwtcyberlab claimed '101% quality preservation' despite 5.69x perplexity. Only coder3101, pew, and trevorjs had their reported divergence match within 1.3% of measurements. The author recommends coder3101's variant for best balance of safety removal and capability retention.
- All 13 ablits lift HarmBench ASR to 82-100%; treadon hits 100% zero refusals with only 2.9-point GSM8K drop.
- coder3101 variant achieves 96% ASR and beats base on GSM8K (84.8% vs 83.5%) by shortening reasoning chains.
- Many creators' 'capabilities preserved' claims were off by 187x (duoneural) or 5.69x perplexity (wwtcyberlab); only 3 of 13 matched reproducibility check.
Why It Matters
Shows abliteration can subtly improve some skills while exposing unreliable safety claims from model creators.