Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos
A new 636K-video dataset and AI model dramatically improve how machines understand complex visual scenes over time.
A research team led by Ziqi Gao from Stanford University has unveiled Synthetic Visual Genome 2 (SVG2), a monumental leap in resources for teaching AI to understand videos. SVG2 is a panoptic video scene graph dataset containing over 636,000 videos annotated with 6.6 million objects, 52 million attributes, and 6.7 million spatio-temporal relations, representing an order-of-magnitude increase in scale and diversity over previous datasets. The dataset was created using a fully automated pipeline that combines advanced computer vision techniques like panoptic segmentation and trajectory tracking with GPT-5-based relation inference. This resource directly addresses a critical bottleneck in video AI: the lack of large-scale, high-quality data that captures how objects, their attributes, and their relationships evolve over time.
Building on SVG2, the team developed TRaSER (Trajectory-aligned Scene Graph Generation), a novel model designed to convert raw video into structured scene graphs in a single forward pass. TRaSER's key innovation is its dual-resampler architecture: a Temporal-Window Resampler that binds visual tokens to short trajectory segments to capture local motion, and an Object-Trajectory Resampler that aggregates entire object lifecycles to maintain global context. This design led to state-of-the-art performance, boosting relation detection by 15-20% and object prediction by 30-40% over the strongest open-source models, while also surpassing GPT-5 by 13%. Crucially, when these generated scene graphs are fed into a Vision-Language Model for video question-answering, they yield a 1.5 to 4.6% absolute accuracy gain, proving the tangible utility of explicit spatio-temporal reasoning as an intermediate representation for complex AI tasks.
- SVG2 dataset contains 636K videos with 6.6M objects and 6.7M relations, created via an automated pipeline using GPT-5.
- TRaSER model improves object prediction by 30-40% over baselines and beats GPT-5 by 13% using novel trajectory-aligned token mechanisms.
- Using TRaSER's scene graphs for video QA gives a 1.5-4.6% accuracy boost, proving the value of structured video understanding.
Why It Matters
This provides the foundational data and models needed for AI to reliably understand complex, dynamic real-world scenes in applications like robotics and autonomous systems.