Audio & Speech

APC audio watermark beats deepfakes with 98% verification rate

Training-free cryptographic signing layer withstands cropping, re-encoding, and low-pass attacks.

Deep Dive

A team led by Guang Yang has introduced Asymmetric Phase Coding (APC), a training-free audio watermarking technique designed as a cryptographic provenance primitive against deepfake voice attacks. APC embeds Ed25519 digital signatures (64 bytes, FIPS 186-5) into the spectrogram by pseudo-randomly selecting STFT phase bins and applying a redundant quantization-index-modulation (QIM) code on log-magnitude differences of adjacent bin pairs. Reed-Solomon error correction ensures reliable extraction even after heavy signal degradation. The watermark is blind-extractable and non-repudiable, meaning anyone with the public key can verify authenticity without needing the original audio.

On 1,000 LibriSpeech test-clean clips (10s each, 44.1 kHz), APC was evaluated against eight attack conditions—including 20% end-cropping, 8 kHz low-pass filtering, FLAC re-encoding, MP3 at 128 kbps, and OGG-Vorbis at 128 kbps. It achieved cryptographic verification rates between 97.5% and 98.3% across all attacks, with mean PESQ audio quality of 3.02 and tens-of-milliseconds CPU latency. The authors compared APC against neural baselines (AudioSeal, WavMark, SilentCipher) and also quantified an adaptive white-box erasure attack. Code, keys, and metadata are released for reproducibility.

Key Points
  • APC uses Ed25519 digital signatures (64 bytes) combined with Reed-Solomon error correction and STFT phase-bin selection for a compact, blind-extractable watermark.
  • Achieves 97.5%–98.3% verification across 8 attack conditions (cropping, re-encoding, filtering) at mean PESQ=3.02 and tens-of-milliseconds CPU latency.
  • Training-free and non-repudiable; explicitly compared against AudioSeal, WavMark, and SilentCipher with open-source code and keys.

Why It Matters

Enables auditable, cryptographically verifiable audio provenance without training—critical for trust in voice-based systems against deepfakes.