DDOR uses delta debugging to identify minimal refusal-triggering fragments (mRTFs) at the phrase level?

DDOR uses delta debugging to identify minimal refusal-triggering fragments (mRTFs) at the phrase level

Automatically generates ~1,000 diverse, model-specific test cases per model for overrefusal testing?

Automatically generates ~1,000 diverse, model-specific test cases per model for overrefusal testing

Targeted prompt repair reduces overrefusal while maintaining safety on truly harmful inputs?

Targeted prompt repair reduces overrefusal while maintaining safety on truly harmful inputs

Developer Tools

DDOR framework automates overrefusal testing and repair in LLMs

arXiv cs.SE June 03, 2026

⚡Delta debugging pinpoints exact phrases causing safe queries to be wrongly rejected

Deep Dive

Overrefusal is a growing problem for LLMs: safety alignment causes models to reject perfectly benign queries that merely look risky (e.g., “how to kill time” instead of “how to kill a person”). A new preprint from researchers introduces DDOR, a fully automated framework that works in a black-box setting—no access to internal safety mechanisms needed. DDOR applies delta debugging to iteratively shrink prompts down to the smallest phrase that triggers a refusal, producing minimal refusal-triggering fragments (mRTFs) with clear, human-readable explanations.

DDOR then leverages those mRTFs to generate diverse, context-rich prompts and runs multi-oracle validation to filter out truly harmful or ambiguous cases. The result is scalable, model-specific test suites of roughly 1,000 cases per model. Beyond testing, DDOR uses the localized fragments for targeted prompt repair—rewriting the problematic parts while preserving original intent. Experiments show substantial overrefusal reduction without degrading safety on genuinely dangerous inputs. The approach offers a practical end-to-end solution for both evaluating and fixing overrefusal, improving LLM usability without compromising guardrails.

Key Points

DDOR uses delta debugging to identify minimal refusal-triggering fragments (mRTFs) at the phrase level
Automatically generates ~1,000 diverse, model-specific test cases per model for overrefusal testing
Targeted prompt repair reduces overrefusal while maintaining safety on truly harmful inputs

Why It Matters

Fixes a key usability pain point in LLMs—unwarranted refusal—without sacrificing safety alignment.

Read Original Article

DDOR framework automates overrefusal testing and repair in LLMs

Why It Matters

Related Articles

🚀 Stay Ahead in AI