Research & Papers

TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

New method slashes latency by integrating streaming ASR, enabling live spoken conversations with AI.

Deep Dive

Researchers Liang-Hsuan Tseng and Hung-yi Lee have introduced TASTE-S (Streamable Text-Aligned Speech Tokenization and Embedding), a significant upgrade designed for real-time spoken language modeling. The core innovation tackles a major bottleneck: traditional systems like their prior TASTE model required a complete audio clip and an external Automatic Speech Recognition (ASR) system to convert speech into a sequence of tokens aligned with text. This non-causal process created high latency, making live interaction impossible. TASTE-S solves this by embedding a streaming, CTC-based ASR module directly into the neural encoder, allowing for instant, dual-modality encoding of speech as it's heard.

By redesigning the unit decoder to be causal—meaning it only uses past and present information—TASTE-S can decode speech tokens on-the-fly. The authors show through joint training that this streamable architecture matches the performance of the original batch-processing TASTE model while dramatically reducing latency. This breakthrough is crucial for developing AI agents that can engage in natural, flowing dialogue. Furthermore, the paper notes the system's robustness to imperfect transcriptions and its capability for long-form encoding and decoding, paving the way for applications like real-time voice assistants, live translation, and interactive tutoring systems that respond without perceptible delay.

Key Points
  • Integrates a streaming CTC-ASR module into the encoder, eliminating dependency on slow external systems.
  • Uses a causal decoder for on-the-fly token generation, enabling real-time processing of speech.
  • Matches the performance of the non-streaming TASTE model while enabling live, long-form conversations.

Why It Matters

This is a key step towards AI that can converse naturally in real-time, powering next-gen voice assistants and interactive agents.