Research & Papers

SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics

New framework improves Qwen-VL safety scores from 12% to 55% against multi-turn attacks.

Deep Dive

A research team led by Haolong Hu has introduced SaFeR-Steer, a novel framework designed to address the critical vulnerability of multimodal large language models (MLLMs) in multi-turn conversations. Current safety alignment methods primarily use single-turn data, creating a dangerous mismatch as attackers can escalate harmful intent across multiple interactions. SaFeR-Steer bridges this gap through staged synthetic bootstrapping combined with tutor-in-the-loop GRPO (Group Relative Policy Optimization), training a single student model under adaptive, on-policy attacks that evolve throughout conversations.

The framework introduces TCSR (Trajectory Cumulative Safety Reward), a key innovation that propagates late-turn safety failures back to earlier turns in the dialogue, enabling more effective learning from multi-turn attack patterns. When applied to Qwen2.5-VL models, the results were dramatic: the 3B parameter version saw multi-turn safety scores jump from 12.55% to 55.58%, while the 7B model improved from 24.66% to 64.89%. These improvements came alongside maintained or enhanced helpfulness scores.

To support this research, the team released the comprehensive STEER dataset, containing 18,161 multi-turn dialogues across three subsets: STEER-SFT (12,934 dialogues) for supervised fine-tuning, STEER-RL (2,000) for reinforcement learning, and STEER-Bench (3,227) for evaluation. The dataset spans conversations of 2-10 turns, specifically designed to train and benchmark models against the type of progressive safety attacks that current single-turn alignment misses completely.

Key Points
  • Improved Qwen2.5-VL-7B multi-turn safety from 24.66% to 64.89% using synthetic attack training
  • Released STEER dataset with 18,161 multi-turn dialogues spanning 2-10 turns for safety benchmarking
  • Introduced TCSR reward mechanism that propagates late-turn safety failures to earlier conversation turns

Why It Matters

Enables safer deployment of multimodal AI assistants in real conversations where attackers use progressive manipulation tactics.