Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context
Researchers combine audio with location data like POIs to reduce AI's confusion between similar environmental sounds.
A team of researchers including Yuanbo Hou, Yanru Wu, and Stephen Roberts has published a new paper introducing Geo-ATBench, a benchmark designed to advance environmental sound recognition by fusing audio with geospatial data. The core problem they address is that traditional audio-only models often struggle to distinguish between acoustically similar sounds—like differentiating a car horn in a residential area versus a factory alarm. Geo-ATBench provides a systematic way to inject location-based context, pairing 10.71 hours of polyphonic audio across 28 sound categories with geospatial semantic context (GSC) derived from 11 categories of geographic information system data, such as points of interest (POIs).
The researchers also propose GeoFusion-AT, a unified framework that tests three methods of combining audio and location data: feature-level, representation-level, and decision-level fusion. Their results demonstrate that models incorporating GSC outperform audio-only baselines, with the most significant gains on labels that are traditionally difficult to separate using sound waves alone. A key validation comes from a crowdsourced listening study with 10 participants on 579 samples, which found no significant performance difference between the AI models and aggregated human labels, confirming the benchmark's alignment with human perception.
This work establishes a foundational task—geospatial audio tagging (Geo-AT)—within the computational auditory scene analysis (CASA) community. By making the dataset, code, and models publicly available, the team provides a reproducible standard for developing AI that understands sound in its real-world context. This moves beyond treating audio as an isolated signal, enabling applications in smart cities, environmental monitoring, and assistive technologies where location provides crucial disambiguating cues.
- The Geo-ATBench dataset contains 10.71 hours of audio paired with geospatial context across 28 sound event categories.
- The GeoFusion-AT framework shows location data improves tagging accuracy, especially for acoustically confounded sounds.
- A human study with 10 participants on 579 samples validated the benchmark's labels as human-aligned.
Why It Matters
Enables smarter environmental AI for applications like urban noise monitoring, wildlife tracking, and context-aware devices by moving beyond audio-only analysis.