PairAlign slashes audio tokens by 55% with self-alignment technique
New framework reduces audio token count by 55% while preserving search accuracy.
Researchers at IIT Kanpur have published PairAlign, a new framework for compact audio tokenization that uses self-alignment at the sequence level. Unlike conventional audio tokenizers that rely on local quantization or clustering, PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a condition, and an autoregressive decoder emits a variable-length token string from a begin token to an end token. This approach learns identity, order, length, and termination directly from data. The key innovation is a contrastive training objective: given two content-preserving views (e.g., different augmentations of the same utterance), each token string is trained to be likely under the other's representation, while unrelated examples provide competing sequences. This surrogate for edit-distance preservation prevents collapse and encourages compactness.
On standard 3-second speech samples from LibriSpeech, PairAlign achieves strong cross-view consistency and operates at 12.71 tokens per second on retrieval tasks. Compared to a standard VQ baseline, it reduces the total number of tokens in the archive by 55% while preserving edit-distance search performance. The authors note a compactness–locality trade-off: PairAlign does not outperform dense geometric or SSL tokenizers on every local metric, but it provides a much lower-rate symbolic interface suitable for comparison, retrieval, and analysis. The paper frames PairAlign as a sequence-symbolic analogue of JEPA-style predictive learning, but predicting a learned variable-length symbolic sequence rather than a continuous latent. The work is under review and spans 50 total pages including appendices.
- PairAlign reduces audio archive tokens by 55% compared to standard VQ tokenization
- Operates at 12.71 tokens per second on 3-second speech retrieval tasks
- Uses cross-paired teacher forcing and anti-bypass regularization to learn variable-length token strings
Why It Matters
Enables efficient audio search and analysis with 55% less storage while preserving edit-distance accuracy.