Xray-Visual Models: Scaling Vision models on Industry Scale Data
The model uses 15B image-text and 10B video-hashtag pairs from Facebook/Instagram with novel noise suppression.
A team of 26 researchers presents Xray-Visual, a unified vision model for image and video understanding. Trained on an unprecedented 25 billion curated image/video pairs from Meta's platforms, it uses a three-stage pipeline combining MAE, hashtag classification, and CLIP-style learning. Built on a Vision Transformer with EViT for efficiency, it achieves state-of-the-art results on ImageNet, Kinetics, and MSCOCO, and shows strong robustness to domain shifts.
Why It Matters
Demonstrates the power of massive, real-world social media data for creating robust, general-purpose vision models that perform well across diverse tasks.