Geometry-Lite reveals how LLMs encode safety across layers with margin geometry
A new probe maps hidden states across 9 models to explain when prompts are flagged as unsafe
A team of researchers has open-sourced Geometry-Lite, a novel probing method that reveals how large language models build safety signals across their internal layers. Rather than relying on opaque, high-dimensional hidden states, Geometry-Lite maps each layer's final prompt-token representation to signed margins using three readout strategies: centroid-based, local-neighborhood, and supervised linear-boundary. The approach was rigorously tested across nine instruction-tuned LLMs ranging from 1.2B to 70B parameters on seven safety benchmarks. The key insight: safety evidence is formed through persistent boundary-position geometry — final or extremal margins and unsafe-side layer occupancy — not through layer-to-layer motion. Surprisingly, finite-difference drift and structural summaries added little to overall detection accuracy, though drift provided small recall-oriented corrections under shifted low-false-positive-rate thresholds.
The findings have direct implications for building safer AI systems. Under benchmark shift, optimized linear boundaries performed sharply on training data but class-conditional mean geometry maintained separation more reliably on hard held-out subsets. This means that safety probes can be made more robust by focusing on persistent margin geometry rather than chasing dynamic signals. For practitioners, Geometry-Lite offers a compact, interpretable instrument that matches or exceeds single-layer probes while remaining close to raw multi-layer score stacking. The paper is available on arXiv under ID 2605.20241, with code expected to follow. For production safety monitoring, the implication is clear: watch where the model places its boundaries, not how fast they move.
- Geometry-Lite tested on 9 instruction-tuned LLMs (1.2B–70B) across 7 safety benchmarks, outperforming single-layer probes
- Key finding: safety evidence is persistent boundary-position geometry, not layer-to-layer motion signals
- Under benchmark shift, class-conditional mean geometry retained separation better than optimized linear boundaries
Why It Matters
Provides an interpretable lens into LLM safety decisions, helping engineers build more reliable, low-false-positive content filters.