Research & Papers

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

Fine-tuning on harmless data crashed Granite Guardian's refusal rate from 85% to 0%.

Deep Dive

A new paper accepted for the AAAI 2026 Summer Symposium reveals a critical vulnerability in guard models used to protect agentic AI pipelines. The team—from multiple universities—tested three purpose-built safety classifiers: Meta's LlamaGuard, Allen AI's WildGuard, and IBM's Granite Guardian. They found that standard domain specialization on completely benign data causes a total collapse of safety alignment. Granite Guardian's refusal rate dropped from 85% to 0%, CKA similarity fell to zero, and every single output became ambiguous—a far more severe degradation than seen in general-purpose LLMs.

The researchers traced the failure to the destruction of latent safety geometry—the structured boundary between harmful and benign representations that guides classification. Using SVD on per-layer activation differences, they showed concentrated safety representations are efficient but catastrophically brittle. To counter this, they proposed Fisher-Weighted Safety Subspace Regularization (FW-SSR), a training-time penalty that uses curvature-aware direction weights from diagonal Fisher information and an adaptive lambda that scales with task-safety gradient conflict. FW-SSR recovered 75% refusal on Granite Guardian with CKA of 0.983 and slashed WildGuard's Attack Success Rate to 3.6%—even below the unmodified baseline—by actively sharpening the safety subspace.

Key Points
  • Granite Guardian's refusal rate dropped from 85% to 0% after fine-tuning on benign data only
  • All three guard models (LlamaGuard, WildGuard, Granite Guardian) lost safety alignment via geometry collapse
  • FW-SSR technique recovers 75% refusal on Granite Guardian and reduces WildGuard attack success rate to 3.6%

Why It Matters

As agentic AI grows, fine-tuning safety guards on benign data can silently disable them—FW-SSR offers a fix.