Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks
Simple iterative rewrites by small open models can erase nearly all AI text watermarks.
A new paper from Ameen et al. demonstrates that statistical watermarks intended to verify AI-generated text can be easily defeated by chaining multiple rewrites — a technique they call Chainwash. The study focuses on diffusion language models (like LLaDA 8B Instruct), which generate tokens in arbitrary order, making traditional autoregressive watermarking schemes inapplicable. The existing watermark by Gloaguen et al. achieves >99% true positive detection on the original unmodified output. However, the researchers tested what happens when that watermarked text is rewritten not once but several times.
They generated 1,605 watermarked completions (~300 tokens each) across five WaterBench domains, then used four small open-weight language models (1.5B to 8B parameters) with no knowledge of the watermark key to rewrite the texts in five different styles: paraphrase, humanize, simplify, academic, and summarize+expand. Each chain ran for up to five hops, producing a total of 160,500 rewritten texts. At the standard significance threshold, 87.9% of original watermarked outputs were detected. After a single rewrite, detection fell to between 14% and 41% depending on the model and style. After five chained rewrites, detection collapsed to just 4.86% — meaning 94.76% of originally flagged texts escaped detection. After three rewrites, the detector score had already dropped 86% of the way toward the null distribution. This scalability and simplicity make Chainwash a serious threat to current watermarking methods for diffusion LMs.
- Detection rate falls from 87.9% on original to 14–41% after just one rewrite, and to 4.86% after five chained rewrites.
- Attack tested across 4 open-weight models (1.5B–8B params), 5 rewrite styles, and 160,500 texts — consistent results across all.
- After three rewrites, the watermark detector score drops 86% toward the null distribution, indicating rapid erosion of detectability.
Why It Matters
For any system trusting LLM watermarks for content provenance, Chainwash shows they can be rendered useless with simple, automated rewriting.