Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching
Researchers propose a flow-matching model that generates scene graphs progressively, not as a one-shot classification task.
A research team led by Xin Hu has introduced FlowSG, a paradigm-shifting approach to Scene Graph Generation (SGG). Traditional SGG models treat the task as a one-shot classification problem, predicting object boxes and their relationships in a single deterministic pass. FlowSG, in contrast, recasts it as a continuous-time generative process. Starting from a noisy initial graph, the model progressively refines both the geometry (bounding boxes) and semantics (object and predicate labels) through a series of constraint-aware steps, guided by a learned conditional velocity field.
Technically, FlowSG uses a VQ-VAE to compress scene graphs into predictable tokens. A graph Transformer then performs dual duties: predicting a velocity field for continuous geometric updates and updating discrete posteriors for categorical tokens, all while coupling semantics and geometry through flow-conditioned message aggregation. This mixed discrete-continuous formulation is trained with flow-matching losses, enabling efficient few-step inference. Crucially, it maintains plug-and-play compatibility with existing object detectors and segmenters, making it practical for integration into current vision pipelines.
The results are compelling. Extensive experiments on standard benchmarks like Visual Genome (VG) and PSG, under both closed- and open-vocabulary settings, show consistent gains. FlowSG improves predicate recall (R/mR) and graph-level metrics, delivering an average improvement of about 3 points over the previous state-of-the-art model, USG-Par. This validates the core hypothesis: building scene graphs progressively through a generative flow is more effective than classifying them in one shot, leading to more accurate and coherent representations of complex visual scenes.
- Reformulates SGG as a progressive generative task using flow matching, moving beyond one-shot classification.
- Couples continuous geometry (boxes) and discrete semantics (labels) via a graph Transformer and flow-conditioned aggregation.
- Achieves a ~3 point average improvement over SOTA (USG-Par) on VG and PSG benchmarks in predicate and graph metrics.
Why It Matters
Enables AI systems to build more accurate, structured understandings of visual scenes, improving applications in robotics, image search, and autonomous systems.