Introduces learnable tokens to query task-relevant interaction regions, eliminating reliance on external visual cues?

Introduces learnable tokens to query task-relevant interaction regions, eliminating reliance on external visual cues.

Achieves state-of-the-art on LIBERO, LIBERO-Plus, and SimplerEnv benchmarks, with strong real-world transfer results?

Achieves state-of-the-art on LIBERO, LIBERO-Plus, and SimplerEnv benchmarks, with strong real-world transfer results.

Robotics

Afford-VLA internalizes affordance for SOTA robot spatial reasoning

arXiv cs.RO May 26, 2026

⚡Learnable <AFF> tokens let robots see where to interact before acting.

Deep Dive

Vision-language-action (VLA) models are promising for generalist robot manipulation, but they often struggle with spatial reasoning—specifically, figuring out exactly where to interact in complex scenes. Existing solutions rely on global geometric cues or externally generated visual signals that are weakly tied to downstream actions. A team of researchers from multiple institutions (including Fudan, KAUST, and Alibaba) proposes a different approach: make planning local, visually grounded, internally generated, and directly aligned with action.

Their new framework, Afford-VLA, introduces learnable <AFF> tokens that query task-relevant interaction regions from multimodal features. These tokens decode affordance masks and convert them into compact embeddings that directly condition action generation. The key insight is that affordance is both generated and utilized within the VLA model, forming a tightly coupled perception-action pathway. The training strategy jointly optimizes the affordance pathway with action prediction, ensuring the visual planning strongly influences downstream control.

Evaluated on multiple simulation benchmarks—LIBERO, LIBERO-Plus, and SimplerEnv—Afford-VLA achieves consistent state-of-the-art performance. The method also demonstrates strong real-world results, showing that internalizing affordance as an action-aligned visual planning interface provides a powerful new paradigm for improving VLA systems. This work addresses a fundamental limitation in robotic manipulation: bridging the gap between visual understanding and precise action execution.

Key Points

Introduces learnable <AFF> tokens to query task-relevant interaction regions, eliminating reliance on external visual cues.
Decodes affordance masks into compact embeddings that directly condition action generation, creating a coupled perception-action loop.
Achieves state-of-the-art on LIBERO, LIBERO-Plus, and SimplerEnv benchmarks, with strong real-world transfer results.

Why It Matters

Makes robot manipulation more spatially aware and action-aligned, moving toward reliable generalist robots in unstructured environments.

Read Original Article

Afford-VLA internalizes affordance for SOTA robot spatial reasoning

Why It Matters

Related Articles

🚀 Stay Ahead in AI