Audio & Speech

LoSATok compresses audio features 10x for better AI generation

New tokenizer shrinks 1280-dim audio to 128 dims without losing quality...

Deep Dive

Audio tokenizers are crucial for unifying understanding and generation, but existing models encode both semantics and acoustic details into high-dimensional continuous latents, burdening generation models like Diffusion Transformers (DiTs). Researchers from multiple institutions introduce LoSATok (Low-dimensional Semantic-Acoustic Tokenizer), which achieves a 10x compression of semantic features—from 1280 dimensions to just 128—while preserving both semantic meaning and acoustic richness. The key innovation is a Semantic Bottleneck regularized by a time-relation loss that maintains temporal feature consistency, plus dual-level semantic supervision that leverages both high- and low-dimensional signals.

Experiments across speech, music, and general audio demonstrate that LoSATok's low-dimensional representations retain strong semantic capacity comparable to state-of-the-art semantic models, while consistently improving DiT generation quality across all three domains. This suggests that highly compressed latent spaces can effectively balance understanding and generation, reducing the modeling burden on diffusion architectures. The project is open-source, offering a practical path for efficient, cross-domain audio AI systems.

Key Points
  • Compresses audio semantic features from 1280 to 128 dimensions using a Semantic Bottleneck and time-relation loss.
  • Dual-level semantic supervision captures both high-level semantics and acoustic details in compact latent space.
  • Improves DiT generation quality on speech, music, and general audio while matching state-of-the-art understanding performance.

Why It Matters

Enables more efficient audio AI with less compute while boosting generation quality across multiple domains.