Research & Papers

Xray-Visual Models: Scaling Vision models on Industry Scale Data

arXiv cs.CV February 20, 2026

⚡The model uses 15B image-text and 10B video-hashtag pairs from Facebook/Instagram with novel noise suppression.

Deep Dive

A team of 26 researchers presents Xray-Visual, a unified vision model for image and video understanding. Trained on an unprecedented 25 billion curated image/video pairs from Meta's platforms, it uses a three-stage pipeline combining MAE, hashtag classification, and CLIP-style learning. Built on a Vision Transformer with EViT for efficiency, it achieves state-of-the-art results on ImageNet, Kinetics, and MSCOCO, and shows strong robustness to domain shifts.

Why It Matters

Demonstrates the power of massive, real-world social media data for creating robust, general-purpose vision models that perform well across diverse tasks.

Read Original Article

Xray-Visual Models: Scaling Vision models on Industry Scale Data

Why It Matters

Stay Ahead in AI