Research & Papers

Google's Gemma 3 shows safety regression with 68.7% attack success rate

Gemma 3 is more vulnerable than both its predecessor and successor—a non-monotonic safety trend.

Deep Dive

A new study from Subhadip Mitra uses quality-diversity evolution (MAP-Elites) as an automated red-teaming probe to test safety alignment across four generations of Google's Gemma family (7B–31B). The results reveal a non-monotonic safety trend: Gemma 3 (12B) exhibits a 68.7% attack success rate (ASR), significantly higher than Gemma 2's 45.5% and Gemma 4's 33.9%. Replaying evolved attack archives shows that attacks from other generations transfer to Gemma 3 at 44–46% but only 14–18% to Gemma 4, indicating that Gemma 4's safety gains generalize beyond the attack distributions evolved against earlier models.

Misinformation vulnerabilities jump from 29% on Gemma 2 to 99% on Gemma 3, remaining elevated at 77% on Gemma 4—a regression that static benchmarks completely miss. Copyright and cybercrime categories register near-100% ASR across all generations, though a second-judge audit suggests the copyright result is sensitive to judge choice. The study underscores that longitudinal, adaptive probing is necessary to catch safety regressions in LLM families, as traditional benchmarks fail to capture these non-linear patterns.

Key Points
  • Gemma 3 (12B) has a 68.7% attack success rate, versus 45.5% for Gemma 2 and 33.9% for Gemma 4.
  • Attack transfer rates drop from 44–46% on Gemma 3 to just 14–18% on Gemma 4, showing generational generalization.
  • Misinformation vulnerability spikes from 29% to 99% between Gemma 2 and Gemma 3, staying at 77% in Gemma 4.

Why It Matters

Safety alignment in LLMs can regress between generations, hidden from static benchmarks—adaptive red-teaming is essential.