Audio & Speech

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

arXiv eess.AS February 19, 2026

⚡Researchers' new model uses interleaved tokens to handle both semantic meaning and acoustic details in audio.

Deep Dive

Researchers from Potsawee Manakul et al. present SODA (Scaling Open Discrete Audio), a suite of native audio foundation models from 135M to 4B parameters. Trained on 500B tokens, it uses a novel interleaved token architecture to jointly model semantic content, acoustic details, and text. This unified approach enables diverse tasks like voice-preserving speech-to-speech translation from a single model backbone, moving beyond text-first audio AI.

Why It Matters

Enables more natural, detailed AI audio generation and editing while preserving speaker identity, advancing beyond simple text-to-speech.

Read Original Article

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

Why It Matters

Stay Ahead in AI