Research & Papers

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

LLMs only fulfill 10.5-37.6% of benign intent initially; new benchmark uncovers recovery failure modes.

Deep Dive

Current LLM safety alignment techniques focus on refusing harmful requests but often fail to recover utility when benign users clarify their intent. To address this, a team from Carnegie Mellon University and the University of Southern California—Mingqian Zheng, Malia Morgan, Liwei Jiang, Carolyn Rose, and Maarten Sap—introduced CarryOnBench, the first interactive benchmark designed to measure whether LLMs revise their interpretation of user intent and restore helpfulness over multiple conversation turns. Starting from 398 queries that appear harmful but have benign underlying intents (e.g., "How to make a poison?" for a chemistry project), they simulated 5,970 conversations by varying user follow-up sequences, evaluating 14 major models including GPT-4, Claude, Llama, and others. The benchmark yielded 1,866 different conversation flows spanning 4–12 turns, totaling 23,880 model responses.

The findings reveal critical shortcomings. At turn one, models fulfill only 10.5–37.6% of the user's benign information need—a stark contrast to 25.1–72.1% when the same query includes the benign intent upfront, confirming that models withhold information due to intent misinterpretation, not limited knowledge. With multi-turn clarifications, 13 of 14 models eventually approach or exceed that single-turn baseline, but the recovery cost varies significantly. The paper identifies three failure modes invisible to single-turn evaluations: utility lock-in (the model rarely updates despite clear clarification), unsafe recovery (it updates but at disproportionate safety cost), and repetitive recovery (it recycles prior responses instead of providing new information). Moreover, conversations converge to similar harmfulness levels regardless of how conservative the model started, suggesting that safety alignment may create a uniformity trap. These results expose a fundamental gap: current safety evaluations may conflate appropriate caution with unresponsiveness, and models need better mechanisms to distinguish benign users from adversaries in extended dialogues.

Key Points
  • Initial utility fulfillment ranges from 10.5–37.6% across 14 models when queries appear harmful; with intent upfront, it jumps to 25.1–72.1%
  • After multi-turn clarification, 13 of 14 models recover utility but vary in recovery cost and safety trade-offs
  • Three newly identified failure modes: utility lock-in (no update), unsafe recovery (disproportionate safety drop), and repetitive recovery (recycled responses)

Why It Matters

Exposes that single-turn safety evaluations miss whether models are appropriately cautious or simply unresponsive to clarified intent.