StepAudio 2.5 unifies ASR, TTS, and realtime dialogue in one model
Single audio-language model matches specialized systems across speech recognition, synthesis, and live interaction.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
StepAudio 2.5, detailed in a new technical report on arXiv, is a unified audio-language foundation model that equals or outperforms specialized systems across automatic speech recognition (ASR), text-to-speech (TTS), and realtime spoken interaction. Rather than treating these tasks as architecturally separate, the model operates on the principle that once text and audio share a multimodal representational space, task specialization becomes a matter of different operational regimes—data construction, optimization targets, and decoding constraints. The key innovation is advancing post-training from standard supervised learning to task-tailored Reinforcement Learning from Human Feedback (RLHF), which serves as the primary mechanism for defining complex optimization targets.
StepAudio 2.5 leverages this RLHF-centric alignment alongside specialized decoding to shape its shared backbone into three distinct operational modes. The ASR branch improves transcription efficiency through verifiable multi-token decoding, the TTS branch achieves controllable and expressive synthesis via preference-based RLHF and context-rich supervision, and the Realtime branch enables low-latency, persona-consistent dialogue using generative reward modeling within an RLHF framework. On standard benchmarks, StepAudio 2.5 achieves state-of-the-art results across all three capabilities, demonstrating that a singular foundation can successfully internalize the distinct deployment objectives of speech understanding, generation, and live interaction.
- StepAudio 2.5 achieves state-of-the-art results on ASR, TTS, and realtime spoken interaction benchmarks with a single unified model.
- Uses task-tailored RLHF alignment instead of standard supervised learning, enabling complex optimization for each operational branch.
- Includes verifiable multi-token decoding for efficient ASR, preference-based RLHF for expressive TTS, and generative reward modeling for realtime dialogue.
Why It Matters
A single model replacing three specialized systems simplifies deployment and cuts costs for speech applications.