IntenBot: Flexible and Imprecise Multimodal Input for LLMs to Understand User Intentions for Casual and Human-Like HRI
Say 'I want that' with a glance — robot understands without precise commands.
IntenBot, developed by Yen-Ting Liu and seven other researchers at National Taiwan University, reimagines human-robot interaction by letting users communicate as naturally as they would with another person. Instead of forcing precise voice commands or rigid gesture patterns, the system accepts rough multimodal input — spoken words, eye gaze, and finger-pointing — and relies on an LLM to disambiguate intention. For instance, saying 'I want that' while glancing vaguely at a bottle is sufficient; the LLM filters out irrelevant modalities and im precise data to generate a plausible instruction, which is then confirmed by the user.
The team first ran a user behavior study in a simulated XR environment to understand natural interaction patterns and calibrate angle ranges for gaze and pointing. They then evaluated IntenBot against baseline methods in an XR study, showing it reduces time, effort, and attention demands. Finally, they deployed the system on a physical robot to demonstrate real-world viability. The paper is available on arXiv (2605.04585) and contributes to more human-like, casual HRI by leveraging LLMs' disambiguation capabilities.
- IntenBot accepts imprecise gestures, gaze, and voice — no need for explicit commands like 'pick up the blue bottle.'
- Uses LLM disambiguation to filter irrelevant input modalities and generate candidate instructions for user confirmation.
- Validated in XR and on a physical robot; reduces user attention and effort compared to traditional multimodal interfaces.
Why It Matters
Makes robot control as casual as talking to a friend — key for natural home and assistive robotics.