Developer Tools

DDOR framework automates overrefusal testing and repair in LLMs

Delta debugging pinpoints exact phrases causing safe queries to be wrongly rejected

Deep Dive

Overrefusal is a growing problem for LLMs: safety alignment causes models to reject perfectly benign queries that merely look risky (e.g., “how to kill time” instead of “how to kill a person”). A new preprint from researchers introduces DDOR, a fully automated framework that works in a black-box setting—no access to internal safety mechanisms needed. DDOR applies delta debugging to iteratively shrink prompts down to the smallest phrase that triggers a refusal, producing minimal refusal-triggering fragments (mRTFs) with clear, human-readable explanations.

DDOR then leverages those mRTFs to generate diverse, context-rich prompts and runs multi-oracle validation to filter out truly harmful or ambiguous cases. The result is scalable, model-specific test suites of roughly 1,000 cases per model. Beyond testing, DDOR uses the localized fragments for targeted prompt repair—rewriting the problematic parts while preserving original intent. Experiments show substantial overrefusal reduction without degrading safety on genuinely dangerous inputs. The approach offers a practical end-to-end solution for both evaluating and fixing overrefusal, improving LLM usability without compromising guardrails.

Key Points
  • DDOR uses delta debugging to identify minimal refusal-triggering fragments (mRTFs) at the phrase level
  • Automatically generates ~1,000 diverse, model-specific test cases per model for overrefusal testing
  • Targeted prompt repair reduces overrefusal while maintaining safety on truly harmful inputs

Why It Matters

Fixes a key usability pain point in LLMs—unwarranted refusal—without sacrificing safety alignment.