In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions
New training methods boost timestamp accuracy and ASR quality simultaneously.
Researchers from IBM Research and the University of Illinois Urbana-Champaign have introduced In-Sync, a method that adapts speech-aware large language models to predict word-level timestamps directly alongside automatic speech recognition (ASR) transcripts. Accepted to ICASSP 2026, the work addresses a critical gap: while modern speech LLMs excel at transcription and richer outputs, precise timestamp prediction—essential for captioning, media search, and multimodal synchronization—has typically required separate alignment tools. The team's novel, lightweight training strategies enhance alignment robustness without degrading recognition quality, achieving gains in both timestamp accuracy and overall ASR performance across multiple benchmark datasets.
In-Sync extends an existing speech-aware LLM architecture by integrating timestamp prediction into the model's output, creating a unified system that eliminates reliance on external aligners. The training strategies are designed to be computationally efficient, requiring minimal additional overhead while delivering measurable improvements. Experiments show that the approach not only refines timestamp precision but also boosts word error rate reductions, suggesting a synergistic effect between alignment and transcription tasks. This breakthrough simplifies pipelines for real-time applications like live captioning and video indexing, where synchronized text is critical. By demonstrating that timestamp prediction can be seamlessly folded into end-to-end speech LLMs, In-Sync paves the way for more robust and efficient multimodal AI systems.
- In-Sync adapts speech-aware LLMs for direct word-level timestamp prediction, removing the need for external alignment tools.
- Lightweight training strategies improve both timestamp accuracy and ASR word error rates across multiple datasets.
- Accepted to ICASSP 2026, the work targets captioning, media search, and multimodal synchronization applications.
Why It Matters
Unified speech LLMs that predict timestamps natively will streamline real-time captioning and media indexing workflows.