Afford-VLA internalizes affordance for SOTA robot spatial reasoning
Learnable <AFF> tokens let robots see where to interact before acting.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Vision-language-action (VLA) models are promising for generalist robot manipulation, but they often struggle with spatial reasoning—specifically, figuring out exactly where to interact in complex scenes. Existing solutions rely on global geometric cues or externally generated visual signals that are weakly tied to downstream actions. A team of researchers from multiple institutions (including Fudan, KAUST, and Alibaba) proposes a different approach: make planning local, visually grounded, internally generated, and directly aligned with action.
Their new framework, Afford-VLA, introduces learnable <AFF> tokens that query task-relevant interaction regions from multimodal features. These tokens decode affordance masks and convert them into compact embeddings that directly condition action generation. The key insight is that affordance is both generated and utilized within the VLA model, forming a tightly coupled perception-action pathway. The training strategy jointly optimizes the affordance pathway with action prediction, ensuring the visual planning strongly influences downstream control.
Evaluated on multiple simulation benchmarks—LIBERO, LIBERO-Plus, and SimplerEnv—Afford-VLA achieves consistent state-of-the-art performance. The method also demonstrates strong real-world results, showing that internalizing affordance as an action-aligned visual planning interface provides a powerful new paradigm for improving VLA systems. This work addresses a fundamental limitation in robotic manipulation: bridging the gap between visual understanding and precise action execution.
- Introduces learnable <AFF> tokens to query task-relevant interaction regions, eliminating reliance on external visual cues.
- Decodes affordance masks into compact embeddings that directly condition action generation, creating a coupled perception-action loop.
- Achieves state-of-the-art on LIBERO, LIBERO-Plus, and SimplerEnv benchmarks, with strong real-world transfer results.
Why It Matters
Makes robot manipulation more spatially aware and action-aligned, moving toward reliable generalist robots in unstructured environments.