Audio & Speech

Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

New model handles 30-minute audio with 1.2k hours of training data...

Deep Dive

A team of researchers led by Mingchen Shao has developed LAT-Audio, a new large audio language model (LALM) designed to overcome the temporal awareness limitations that plague existing models when processing long-form audio. While current LALMs perform well on short clips (under a minute), their accuracy degrades significantly on longer inputs, especially for tasks requiring precise time-stamped understanding like pinpointing specific sounds or events within a 30-minute recording. To address this, the team created LAT-Chronicle, a massive 1.2k-hour dataset of real-world audio with temporal annotations, and LAT-Bench, the first human-verified benchmark that supports audio up to 30 minutes across three core tasks: Dense Audio Caption, Temporal Audio Grounding, and Targeted Audio Caption.

LAT-Audio's key innovation is a progressive global-to-local reasoning paradigm. It first constructs a global timeline that aligns audio segments with semantic context, then introduces Think-With-Audio Chain-of-Thought (TWA-CoT) to iteratively refine its understanding by incorporating local audio details through tool use. This approach allows the model to maintain temporal alignment over long durations, unlike previous methods that lose accuracy as audio length increases. Experimental results show LAT-Audio significantly outperforms existing LALMs on all three temporal awareness tasks, demonstrating robust performance even on 30-minute inputs. The team has released the dataset, benchmark, and model publicly to accelerate research in long-form audio understanding, a critical area for applications like meeting transcription, podcast analysis, and surveillance audio processing.

Key Points
  • LAT-Audio uses a global-to-local reasoning paradigm with TWA-CoT for iterative temporal refinement
  • LAT-Chronicle dataset includes 1,200 hours of real-world audio with temporal annotations
  • LAT-Bench is the first human-verified benchmark supporting audio up to 30 minutes across three tasks

Why It Matters

Enables precise time-stamped understanding of long audio, critical for meetings, podcasts, and surveillance.