Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models
A simple two-step trick bypasses safety in models like LLaDA-8B and Dream-7B without gradients or search.
A new research paper by Arth Singh, titled 'Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models,' exposes a fundamental and surprisingly simple flaw in the safety alignment of diffusion-based LLMs (dLLMs). These models, such as LLaDA-8B-Instruct and Dream-7B-Instruct, generate text by iteratively denoising masked tokens. Their safety relies on a single assumption: that once a token is committed (like a 'refusal' to a harmful request) in the first 8-16 steps of a 64-step process, it is never re-evaluated. Singh's attack violates this by re-masking those early refusal tokens and injecting a short, 12-token affirmative prefix, effectively redirecting the generation.
This trivial intervention, requiring no gradient computation or adversarial search, achieved a 76.1% Attack Success Rate (ASR) against LLaDA-8B-Instruct and 81.8% against Dream-7B-Instruct on the HarmBench safety evaluation suite. The paper's central finding is the vulnerability's simplicity; more sophisticated gradient-based attacks actually performed worse (41.5% ASR), proving the weakness is structural, not a failure of robust training. The safety is 'architecturally shallow,' holding only because the standard denoising schedule is never challenged.
The research concludes that current dLLM safety is not adversarially robust. It proposes potential defenses, including safety-aware unmasking schedules that don't commit too early, step-conditional prefix detection to spot injected guidance, and post-commitment re-verification mechanisms. This work forces a reevaluation of safety in an emerging class of language models, highlighting that alignment must be designed into the core generation process, not just layered on top.
- Simple 'Re-Mask' attack achieves 81.8% success rate against Dream-7B-Instruct on HarmBench.
- Exploit works by re-masking early refusal tokens and injecting a 12-token affirmative prefix, requiring no gradients.
- Reveals dLLM safety is 'architecturally shallow,' relying solely on an unviolated denoising schedule.
Why It Matters
Forces a fundamental redesign of safety in diffusion LLMs, showing current alignments can be bypassed with trivial interventions.