Achieves state-of-the-art benchmark performance with a highly efficient 2.3 billion parameter model, outperforming larger competitors?

Achieves state-of-the-art benchmark performance with a highly efficient 2.3 billion parameter model, outperforming larger competitors.

Introduces a production-ready RAG system for million-scale hotword customization with sub-millisecond retrieval latency for real-time use?

Introduces a production-ready RAG system for million-scale hotword customization with sub-millisecond retrieval latency for real-time use.

Redesigned three-stage training pipeline specifically targets robustness against noise, silence, and hallucination in challenging acoustic conditions?

Redesigned three-stage training pipeline specifically targets robustness against noise, silence, and hallucination in challenging acoustic conditions.

Audio & Speech

NIM4-ASR: A 2.3B-parameter LLM-based speech model beats giants with real-time RAG

arXiv eess.AS April 21, 2026

⚡New framework achieves SOTA with 2.3B parameters, supports million-scale hotword customization via sub-millisecond RAG.

Deep Dive

A research team of 12 authors, led by Yuan Xie, has published a paper detailing NIM4-ASR, a new framework designed to make Large Language Model-based Automatic Speech Recognition (LLM-based ASR) practical for real-world deployment. The work directly tackles the major shortcomings of current data-driven models: poor performance in resource-limited environments and a tendency to 'hallucinate' incorrect text under challenging acoustic conditions like background noise. NIM4-ASR's core innovation is a principled redesign of the training pipeline, clearly separating the roles of the audio encoder and the LLM to improve efficiency and accuracy.

The framework introduces a three-stage training process: a reformulated pre-training stage to bridge the gap between audio and text modalities, an iterative fine-tuning stage to preserve acoustic details, and a specialized reinforcement learning stage to boost robustness. This allows the compact 2.3B-parameter model to outperform larger competitors, especially in entity-intensive scenarios like transcribing names and technical terms. For production use, it includes real-time streaming inference and a powerful retrieval-augmented generation (RAG) system that can inject a million custom 'hotwords'—like product names or user-specific vocabulary—with retrieval latency under a millisecond, enabling dynamic personalization without retraining the core model.

Key Points

Achieves state-of-the-art benchmark performance with a highly efficient 2.3 billion parameter model, outperforming larger competitors.
Introduces a production-ready RAG system for million-scale hotword customization with sub-millisecond retrieval latency for real-time use.
Redesigned three-stage training pipeline specifically targets robustness against noise, silence, and hallucination in challenging acoustic conditions.

Why It Matters

Enables accurate, customizable speech recognition for real-time applications like call centers and assistants, even on limited hardware.

Read Original Article

NIM4-ASR: A 2.3B-parameter LLM-based speech model beats giants with real-time RAG

Why It Matters

Related Articles

🚀 Stay Ahead in AI