Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR
New method replaces raw audio with learned tokens, cutting computational cost while preserving contextual accuracy.
A research team including Shashi Kumar, Esaú Villatoro-Tello, and eight others has published a paper introducing 'Abstract Compression,' a novel method for improving Large Language Model-based Automatic Speech Recognition (LLM-based ASR). Current systems typically process speech utterances in isolation, missing valuable conversational context that could improve accuracy, especially for contextual entities like names and specific terms. Their research confirms that using multi-turn conversational context does help recognition, but conditioning models on raw prior audio is computationally expensive because the audio token sequence grows rapidly with conversation length.
To solve this, the team proposes replacing the audio portion of previous conversation turns with a fixed number of learned latent tokens, while explicitly retaining the corresponding text transcripts. This compressed representation significantly reduces the model's 'prior-turn audio footprint.' Testing on both in-domain and out-of-domain datasets showed that models using Abstract Compression recovered a substantial portion of the accuracy gains achieved by models using full raw context, but at a fraction of the computational cost. The paper includes detailed analyses of the compression setup and its performance trade-offs, demonstrating a practical path toward more efficient, context-aware speech recognition systems.
- Replaces raw prior audio with fixed learned tokens, slashing computational load
- Retains explicit text transcripts to preserve key contextual information
- Recovers most accuracy gains for contextual entities like names without full audio cost
Why It Matters
Enables scalable, accurate transcription of meetings and calls by making conversational context affordable for LLMs.