Affordance2Action teaches robots to grasp functional parts based on tasks
Robots can now identify functional object parts for specific tasks in cluttered scenes
A team of researchers from multiple institutions has released Affordance2Action (A2A), a benchmark-centered learning framework that lets robots understand which parts of objects to interact with based on specific tasks and cluttered scenes. Unlike traditional affordance datasets that focus on grasping or object-level categories, A2A tackles the challenging problem of task-conditioned, scene-level part affordance grounding. The core of the work is A2A-Bench, a manipulation-oriented benchmark covering both single-region and multi-region instruction correspondences in everyday environments. Multi-region cases highlight ambiguity: the same object may afford different interactions across tasks, and a single task may correspond to multiple valid functional regions depending on scene layout.
To build A2A-Bench at scale, the researchers created A2A-AffordGen, an innovative agent-assisted annotation pipeline that combines language-model filtering, interactive part segmentation, instance-level mask-out refinement, and human verification. This pipeline enables efficient generation of high-quality affordance annotations with task-relevant instructions. In experiments, A2A exposed substantial gaps in generic segmentation, VLM-based grounding, and affordance distillation baselines. The framework improves task-level localization and provides useful spatial priors for downstream manipulation. Real-time affordance grounding and affordance-conditioned manipulation policies are demonstrated as key applications. All datasets and code are planned for public release.
- Introduces A2A-Bench covering single-region and multi-region instruction correspondences in everyday scenes for task-conditioned affordance grounding
- A2A-AffordGen pipeline uses LLM filtering, interactive part segmentation, instance-level mask-out refinement, and human verification to generate annotations at scale
- Experiments expose gaps in segmentation, VLM-based grounding, and affordation distillation baselines, while improving real-time manipulation accuracy
Why It Matters
Bridges the gap between language instructions and precise robot actions in cluttered, real-world environments.