Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
Researchers' new model uses interleaved tokens to handle both semantic meaning and acoustic details in audio.
Researchers from Potsawee Manakul et al. present SODA (Scaling Open Discrete Audio), a suite of native audio foundation models from 135M to 4B parameters. Trained on 500B tokens, it uses a novel interleaved token architecture to jointly model semantic content, acoustic details, and text. This unified approach enables diverse tasks like voice-preserving speech-to-speech translation from a single model backbone, moving beyond text-first audio AI.
Why It Matters
Enables more natural, detailed AI audio generation and editing while preserving speaker identity, advancing beyond simple text-to-speech.