ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction
The champion model uses adaptive gating and ensemble predictions to outperform competitors on key metrics.
A research team has won the prestigious NTIRE 2026 Challenge on Video Saliency Prediction with their novel AI framework called ViSAGE (Video Saliency with Adaptive Gated Experts). Presented at CVPR 2026, the solution addresses the complex task of predicting where humans will look in video sequences—a critical capability for applications from content compression to augmented reality. The team's approach innovates by employing a multi-expert ensemble architecture, where several specialized neural network decoders, each with different "inductive biases" or learning assumptions, work in parallel.
Each expert decoder in ViSAGE performs adaptive gating and feature modulation to refine spatio-temporal information extracted from the video. This design allows different experts to capture complementary aspects of saliency, such as motion cues, object recognition, or scene context. At inference, the system intelligently fuses the predictions from all experts. The result is a model that aggregated diverse perspectives to outperform most competing solutions, securing first place on two out of four official evaluation metrics on the challenge's Private Test set. The team has made their code publicly available, providing a new state-of-the-art baseline for the computer vision community.
- ViSAGE won the NTIRE 2026 Video Saliency Prediction Challenge, ranking first on 2 out of 4 key metrics.
- The model uses a multi-expert ensemble framework where specialized decoders perform adaptive gating on spatio-temporal features.
- The team has released the code publicly, providing a new benchmark for predicting human gaze in videos.
Why It Matters
Advancements in video saliency prediction directly improve video compression, AR/VR experiences, and content creation tools.