Research & Papers

GenMatter: Perceiving Physical Objects with Generative Matter Models

A new AI that sees objects like humans do, even from random dots...

Deep Dive

A team of researchers from MIT and Harvard, including Eric Li, Arijit Dasgupta, and Joshua B. Tenenbaum, has introduced GenMatter, a generative model designed to perceive physical objects by mimicking human visual perception. Unlike existing computer vision systems that lack a unified approach to motion-based scene interpretation, GenMatter hierarchically groups low-level motion cues and high-level appearance features into particles—small Gaussians representing local matter—and then groups these particles into clusters that capture coherently and independently moveable physical entities. This allows the model to operate on diverse inputs, including random dot kinematograms, stylized textures, and naturalistic RGB video, bridging the gap where biological vision succeeds but traditional computer vision fails.

GenMatter's hardware-accelerated inference algorithm, based on parallelized block Gibbs sampling, recovers stable particle motion and groupings across different settings. The model was validated across three domains: on 2D random dot kinematograms, it captures human object perception including graded uncertainty under ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, it recovers correct 3D structure from motion, enabling accurate 2D object segmentation; and on naturalistic RGB videos, it tracks moving 3D matter that constitutes deforming objects, enabling robust object-level scene understanding. Presented at CVPR 2026, this work establishes a general framework for motion-based perception grounded in principles of human vision, with potential applications in robotics, autonomous systems, and scene understanding.

Key Points
  • GenMatter uses a generative model to group motion and appearance features into particles and clusters, mimicking human perception.
  • It works across three input types: random dots, textures, and RGB video, outperforming existing computer vision in ambiguous conditions.
  • The model uses hardware-accelerated block Gibbs sampling for stable particle motion recovery and object segmentation.

Why It Matters

Brings AI closer to human-like vision, enabling robust object perception in autonomous systems and robotics.