Open-vocabulary separation via CLAP text embeddings enables users to describe any sound source for extraction?

Open-vocabulary separation via CLAP text embeddings enables users to describe any sound source for extraction

Runs at 1.35 GMACs end-to-end—54x less compute than AudioSep, enabling low-latency edge deployment?

Runs at 1.35 GMACs end-to-end—54x less compute than AudioSep, enabling low-latency edge deployment

Outperforms AudioSep on SI-SDR and MOS-LQS across multiple benchmarks, including dnr-v2 and open-domain datasets?

Outperforms AudioSep on SI-SDR and MOS-LQS across multiple benchmarks, including dnr-v2 and open-domain datasets

Audio & Speech

CodecSep separates any sound with text prompts, using 54x less compute than AudioSep

arXiv eess.AS June 26, 2026

⚡Extract any audio source directly in codec space with open-vocabulary prompts and minimal compute.

Deep Dive

A new paper from researchers Adhiraj Banerjee and Vipul Arora presents CodecSep, a prompt-driven universal sound separation framework that works directly in neural audio codec latent space. Unlike traditional systems like AudioSep that require decoding audio first, CodecSep combines a frozen DAC (descript audio codec) backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP text embeddings. This design allows open-vocabulary separation—users can describe any sound with text and extract it—while preserving the efficiency of a codec-native pipeline. The model separates sources through explicit latent masking rather than decoder-style generation, which the authors show is substantially more effective in codec space.

On benchmarks including dnr-v2 and five open-domain datasets, CodecSep consistently improves over AudioSep in SI-SDR (signal-to-distortion ratio) and achieves competitive ViSQOL scores with clear gains in human MOS-LQS. But the standout metric is compute: CodecSep requires only 1.35 GMACs end-to-end, roughly 54x less than AudioSep in the same pipeline and 25x lower for the separator alone, with significantly lower latency and memory. It also provides a practical deployment path for code-stream audio—when audio arrives as neural codec codes, CodecSep maps them to embeddings, separates in codec space, and outputs waveforms or re-quantized codes, avoiding the decode-separate-re-encode loop. This makes CodecSep a blueprint for efficient, codec-native downstream audio processing.

Key Points

Open-vocabulary separation via CLAP text embeddings enables users to describe any sound source for extraction
Runs at 1.35 GMACs end-to-end—54x less compute than AudioSep, enabling low-latency edge deployment
Outperforms AudioSep on SI-SDR and MOS-LQS across multiple benchmarks, including dnr-v2 and open-domain datasets

Why It Matters

CodecSep makes real-time, open-ended sound separation practical on edge devices with a 54x compute reduction.

Read Original Article

CodecSep separates any sound with text prompts, using 54x less compute than AudioSep

Why It Matters

Related Articles

🚀 Stay Ahead in AI