Audio & Speech

XAttnMark audio watermarking achieves SOTA detection and attribution with cross-attention

New cross-attention watermark beats deepfake audio editing at varying strengths.

Deep Dive

The rapid rise of generative audio synthesis and editing has created urgent needs for robust watermarking to protect copyright and combat deepfake misinformation. Existing neural methods like WavMark and AudioSeal struggle to jointly optimize detection and attribution. XAttnMark bridges this gap with a novel architecture that pairs a generator and detector through cross-attention mechanisms and partial parameter sharing. It also introduces a temporal conditioning module to improve message distribution and a psychoacoustic-aligned time-frequency masking loss that models fine-grained auditory masking for better imperceptibility.

In extensive tests, XAttnMark achieves state-of-the-art performance across both detection and attribution tasks, maintaining robustness against a wide range of audio transformations including challenging generative editing at varying strengths. The work, accepted at ICML 2025, provides a practical solution for verifying audio provenance and intellectual property in the generative AI era.

Key Points
  • Cross-attention mechanism between generator and detector for efficient message retrieval and joint optimization.
  • Psychoacoustic-aligned time-frequency masking loss improves imperceptibility by modeling human auditory masking effects.
  • Superior robustness against generative audio editing (e.g., style transfer, re-synthesis) at varying strengths.

Why It Matters

Protects audio IP and authenticates content against rising deepfake and generative editing threats.