SODA AI model trains on 500B tokens for unified audio/text generation
Researchers' new model uses interleaved tokens to handle both semantic meaning and acoustic details in audio.
Researchers from Potsawee Manakul et al. present SODA (Scaling Open Discrete Audio), a suite of native audio foundation models from 135M to 4B parameters. Trained on 500B tokens, it uses a novel interleaved token architecture to jointly model semantic content, acoustic details, and text. This unified approach enables diverse tasks like voice-preserving speech-to-speech translation from a single model backbone, moving beyond text-first audio AI.
Why It Matters
Enables more natural, detailed AI audio generation and editing while preserving speaker identity, advancing beyond simple text-to-speech.