VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
New encoder-only architecture eliminates complex tracking modules while running 5-10x faster than existing methods.
A research team from TU Eindhoven and RWTH Aachen University has unveiled VidEoMT (Video Encoder-only Mask Transformer), a breakthrough in video segmentation that dramatically simplifies architecture while boosting speed. The model eliminates the complex tracking modules typically required for video segmentation, instead using a plain Vision Transformer (ViT) encoder with a novel query propagation mechanism.
VidEoMT's key innovation is its lightweight approach to temporal modeling. Rather than adding specialized tracking components, the model reuses queries from previous frames through a propagation mechanism, then combines these with temporally-agnostic learned queries to adapt to new content. This query fusion strategy maintains tracking capabilities without architectural complexity. The result is remarkable efficiency: VidEoMT achieves up to 160 frames per second with a ViT-L backbone, representing a 5-10x speed improvement over existing methods while maintaining competitive segmentation accuracy.
The research builds on recent findings that sufficiently scaled ViT encoders can perform accurate image segmentation without specialized modules. By extending this principle to video, the team demonstrates that complex tracking architectures may be unnecessary when foundation models are properly leveraged. This approach could significantly lower the computational barrier for real-time video analysis applications, from autonomous vehicles to video editing tools. The code is publicly available, potentially accelerating adoption across the computer vision community.
- Eliminates specialized tracking modules through query propagation, reducing architectural complexity
- Achieves 5-10x faster inference speeds, reaching up to 160 FPS with ViT-L backbone
- Maintains competitive segmentation accuracy while using only encoder-only Vision Transformer architecture
Why It Matters
Enables real-time video analysis for applications like autonomous driving and content creation with dramatically lower computational costs.