One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
Researchers built a turn-level monitor that spots harmful intent spread across multiple innocent-looking prompts.
A growing threat to deployed LLMs is the hidden malicious intent spread across multiple benign-looking dialogue turns, bypassing single-prompt guardrails. Even advanced commercial models with safety alignment remain vulnerable. Researchers from Purdue, Georgia Tech, UChicago, and IBM Research tackled this by proposing TurnGate, a turn-level monitor that identifies the earliest turn at which delivering a response would make the accumulated interaction sufficient for harmful action. This precise intervention avoids prematurely refusing benign exploratory conversations.
To support training and evaluation, the team built the Multi-Turn Intent Dataset (MTID), featuring branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. TurnGate significantly outperforms existing baselines in harmful-intent detection while keeping over-refusal rates low. It also generalizes across different domains, attacker pipelines, and target models. The code and dataset are publicly available, offering a practical defense against sophisticated multi-turn jailbreaks.
- TurnGate detects the earliest turn where a response would enable malicious action, preventing both harm and premature refusal.
- The MTID dataset includes branching attack rollouts and benign hard negatives to train precise turn-level interventions.
- Outperforms existing baselines in accuracy while maintaining low over-refusal rates, and generalizes across domains and models.
Why It Matters
Protects LLMs from sophisticated multi-turn jailbreaks, reducing safety risks while maintaining conversational utility.