New geometric detection system catches multi-turn LLM deception with 89% recall
Multi-turn probing attacks leave a stable geometric footprint detectable by a lightweight classifier.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new paper from researchers Surender Suresh Kumar and Mary L. Cummings tackles a critical blind spot in LLM safety: multi-turn deception. While most safety defenses are trained on single-turn prompts, real-world attacks often unfold as indirect, multi-turn probing sequences. The authors introduce a unified pipeline that first generates realistic multi-turn deceptive question sets using multi-objective genetic prompt optimization with co-evolving mutation operators. A human study confirmed the dataset's quality, revealing that early generations produced the most convincing deception and that practical constraints like adherence filtering and ordering effects matter.
The core contribution is a detection method that exploits geometric signatures in embedding space. By analyzing three geometric features—angular coverage, distance ratio, and linearity—alongside pairwise similarity statistics, a lightweight feed-forward classifier achieved consistently high recall (0.89) and test-time F1 scores ranging from 0.74 to 0.86 across base, reworded, and truncated three-turn scenarios. This supports the hypothesis that multi-turn deceptive intent leaves a stable geometric footprint, enabling transparent, low-cost screening without expensive end-to-end training. The authors also discuss responsible uses, limitations, and the need for larger human-evaluated datasets.
- Multi-objective genetic prompt optimization creates realistic multi-turn deceptive sequences, validated by a human study.
- Three geometric features (angular coverage, distance ratio, linearity) plus pairwise similarity detect deception with 89% recall.
- Lightweight classifier achieves F1 scores 0.74-0.86 across varied scenarios, enabling transparent screening without heavy training.
Why It Matters
Enables low-cost, transparent screening for multi-turn LLM deception without expensive end-to-end training.