Research & Papers

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Researchers built a turn-level monitor that spots harmful intent spread across multiple innocent-looking prompts.

Deep Dive

A growing threat to deployed LLMs is the hidden malicious intent spread across multiple benign-looking dialogue turns, bypassing single-prompt guardrails. Even advanced commercial models with safety alignment remain vulnerable. Researchers from Purdue, Georgia Tech, UChicago, and IBM Research tackled this by proposing TurnGate, a turn-level monitor that identifies the earliest turn at which delivering a response would make the accumulated interaction sufficient for harmful action. This precise intervention avoids prematurely refusing benign exploratory conversations.

To support training and evaluation, the team built the Multi-Turn Intent Dataset (MTID), featuring branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. TurnGate significantly outperforms existing baselines in harmful-intent detection while keeping over-refusal rates low. It also generalizes across different domains, attacker pipelines, and target models. The code and dataset are publicly available, offering a practical defense against sophisticated multi-turn jailbreaks.

Key Points
  • TurnGate detects the earliest turn where a response would enable malicious action, preventing both harm and premature refusal.
  • The MTID dataset includes branching attack rollouts and benign hard negatives to train precise turn-level interventions.
  • Outperforms existing baselines in accuracy while maintaining low over-refusal rates, and generalizes across domains and models.

Why It Matters

Protects LLMs from sophisticated multi-turn jailbreaks, reducing safety risks while maintaining conversational utility.