Audio & Speech

Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

A new paper introduces a multi-stage training strategy to optimize speech-to-text AI, achieving competitive results with just 2.3B parameters.

Deep Dive

A team of researchers has published a new paper on arXiv titled 'Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs'. The work addresses core challenges in modern Automatic Speech Recognition (ASR) systems that integrate Large Language Models (LLMs). While these LLM-based ASR models show promise, they struggle with balancing recognition quality, latency, and computational overhead. A significant roadblock to real-world deployment is the problem of 'hallucinations,' where the model generates plausible but incorrect text.

The researchers tackle this by revisiting the system design from an 'entropy allocation' perspective. They introduce three new metrics to analyze how different training methods distribute the task of reducing uncertainty (entropy) between the speech encoder (which processes audio) and the LLM (which generates text). They found prevailing methods to be inefficient in this allocation. To fix this, they propose a principled, multi-stage training strategy designed with 'capability-boundary awareness.' This strategy includes a redesigned pre-training phase to bridge the gap between speech and text data modalities and introduces a novel 'iterative asynchronous Supervised Fine-Tuning (SFT)' stage. This stage is placed between alignment and joint training to preserve a functional decoupling between the encoder and LLM, which helps constrain representation drift in the encoder—a key contributor to hallucinations.

Experiments conducted on standard Mandarin and English benchmarks demonstrate the effectiveness of their approach. Their method achieves competitive, state-of-the-art performance while utilizing a much more efficient model with only 2.3 billion parameters. Crucially, the decoupling-oriented design of their training pipeline proves highly effective at mitigating the hallucination problem, making such systems more reliable for practical applications.

Key Points
  • Proposes a new 'entropy allocation' framework with three metrics to analyze efficiency in LLM-based speech recognition systems.
  • Introduces a multi-stage training strategy with a novel 'iterative asynchronous SFT' stage to preserve component decoupling and reduce hallucinations.
  • Achieves competitive benchmark performance using a highly efficient 2.3B parameter model, addressing key deployment challenges of size and reliability.

Why It Matters

This research could lead to more accurate, efficient, and reliable voice AI for assistants, transcription, and real-time translation tools.