Audio & Speech

Wavelet tokenizer unifies audio, image, and video into one AI model

New research shows a single wavelet-based tokenizer outperforms separate encoders for 3 modalities.

Deep Dive

A new preprint from Shenghao Ding introduces a groundbreaking approach to multimodal AI: using wavelets as a common tokenizer for audio, images, and video. Instead of relying on separate modality-specific latent grids—like patch embeddings for images or spectrogram patches for audio—the paper proposes a shared wavelet token schema built around a one-level Haar Discrete Wavelet Transform (DWT) frontend. The model includes a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. This design treats all natural signals on equal footing, allowing a single model to process speech commands, satellite images, and video frames with minimal modality-specific tweaks.

Results on Speech Commands, EuroSAT RGB, and DAVIS 2017 show the dense shared model achieving 39.92 dB PSNR for audio reconstruction, 29.37 dB for images, and 23.93 dB for video. A key finding is that fixed-rate energy selection (i.e., keeping only the highest-energy wavelet coefficients) provides a strong non-parametric baseline, improving average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Additionally, masked sparse training reaches 34.45 dB video PSNR with only 50% of dense tokens. The paper supports a unified wavelet token schema and sparse token interface, though it stops short of establishing a universal discrete vocabulary.

Key Points
  • Unified Haar DWT tokenizer achieves 39.92 dB (audio), 29.37 dB (image), and 23.93 dB (video) PSNR on standard benchmarks.
  • Fixed-rate energy selection boosts average PSNR by ~16.9 dB across all three modalities versus uniform token selection.
  • Masked sparse training reaches 34.45 dB video PSNR using only 50% of tokens, showing viability for efficient processing.

Why It Matters

A single architecture for audio, image, and video could drastically simplify multimodal AI systems and reduce training costs.