Audio & Speech

Direct Simultaneous Translation Activation for Large Audio-Language Models

No architecture changes needed: 1% simulated data unlocks simultaneous translation

Deep Dive

Simultaneous speech-to-text translation (Simul-S2TT) traditionally requires custom model architectures or decoding strategies to output translations while still receiving input. However, with the rise of large audio-language models (LALMs), a key question is whether these capabilities can be directly activated without architectural changes. In a new paper accepted at ICASSP 2026, researchers from multiple institutions introduce SimulSA (Simultaneous Self-Augmentation), a strategy that leverages LALMs' inherent abilities to generate simultaneous data by randomly truncating speech and constructing partially aligned translations. By adding just 1% of this simulated simultaneous data to the full offline supervised fine-tuning (SFT) dataset, the method effectively bridges the distribution gap between offline training and real-time inference.

The results are striking: SimulSA activates robust Simul-S2TT capabilities without any modifications to the model architecture or decoding strategy. This means existing LALMs can immediately gain real-time translation abilities with only a tiny fraction of specialized data. The approach is lightweight—no additional training loops or custom layers are needed—and practical for deployment in live interpretation systems, voice assistants, and multilingual communication tools. By demonstrating that less than 1% simultaneous data suffices, the work opens the door for rapid adaptation of foundation audio models to streaming tasks. The paper is available on arXiv and accepted at ICASSP 2026, a top conference for audio and speech processing.

Key Points
  • Only 1% simultaneous data (vs. full offline SFT data) is needed to activate real-time translation capabilities.
  • No changes to model architecture or decoding strategy are required—works with existing large audio-language models.
  • Accepted at ICASSP 2026, a flagship conference for audio and speech processing.

Why It Matters

Enables real-time translation without costly model redesign, making live interpretation practical for current LALMs.