HAMMER: Harnessing MLLM via Cross-Modal Integration for Intention-Driven 3D Affordance Grounding
New CVPR 2026 model infers 3D affordance from 2D images with a novel contact-aware embedding, beating benchmarks.
A research team led by Lei Yao has unveiled HAMMER, a groundbreaking AI framework accepted at CVPR 2026 that tackles the complex problem of 3D affordance grounding—teaching machines to understand *how* objects can be interacted with in three-dimensional space. Instead of relying on traditional methods like generating explicit object descriptions or using pre-trained 2D image segmenters, HAMMER innovatively harnesses the power of emerging Multimodal Large Language Models (MLLMs). The core idea is inspired by human learning: we observe interactions in images or videos (like someone holding a cup by its handle) and then generalize that 'affordance' knowledge to new, unseen objects. HAMMER's novel approach extracts the depicted 'interaction intention' from a 2D image and condenses it into a sophisticated 'contact-aware embedding'.
Technically, HAMMER introduces a hierarchical cross-modal integration mechanism to fully exploit complementary information from the MLLM, refining the 3D representation of an object. It also employs a multi-granular geometry lifting module that infuses crucial spatial characteristics into the extracted intention embedding. This combination enables highly accurate 3D localization of affordance regions—pinpointing exactly *where* on a 3D model an action like 'grasp' or 'sit' is applicable. Extensive experiments, including on a newly constructed corrupted benchmark to test robustness, demonstrate HAMMER's superiority over existing approaches. The project is fully open-source, providing all code and model weights publicly, which accelerates research in robotics, augmented reality, and human-computer interaction by providing a foundational model for understanding object utility.
- Leverages MLLMs to extract 'interaction intention' from 2D images, creating a novel contact-aware embedding for 3D inference.
- Introduces a hierarchical cross-modal integration mechanism and geometry lifting module for precise 3D affordance localization.
- Demonstrated state-of-the-art performance and robustness on public datasets and a new corrupted benchmark; code is open-source.
Why It Matters
Enables more intuitive robots and AR systems that understand object use from simple observations, bridging 2D vision and 3D action.