Robotics

An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction

arXiv cs.RO February 25, 2026

⚡A new framework combines Florence-2, Llama 3.1, and Whisper to let users command robots with natural language.

Deep Dive

Researchers Guanting Shen and Zi Tian have introduced a novel multimodal framework that significantly advances how robots interpret human intent. Published on arXiv, their work presents a system designed for human-robot interaction (HRI) that seamlessly combines video and speech processing with large language models (LLMs). The core innovation is a tightly integrated architecture that uses Microsoft's Florence-2 for real-time object detection, Meta's Llama 3.1 for natural language understanding and task planning, and OpenAI's Whisper for accurate speech recognition. This combination allows a user to issue a spoken command like "pick up the red block," and the system interprets the intent, identifies the object in the visual scene, and plans the robotic action accordingly.

The technical approach employs fuzzy logic to enhance the reliability of command interpretation, making the system more adaptable to ambiguous or incomplete instructions. Crucially, the entire pipeline runs on consumer-grade hardware, demonstrating a practical and accessible pathway for advanced HRI. In experimental evaluations, the system controlling a Dobot Magician robotic arm achieved a 75% accuracy rate in command execution. This performance highlights both the robustness of the current integration and the framework's potential as an extensible foundation. The researchers position their work not just as a standalone system, but as a flexible blueprint for future development, enabling more sophisticated and natural collaboration where robots can dynamically understand and act upon multimodal human communication.

Key Points

Integrates Florence-2 (vision), Llama 3.1 (language), and Whisper (speech) into a unified control system for a Dobot Magician arm.
Achieved 75% command execution accuracy in tests, running entirely on accessible consumer-grade hardware.
Uses fuzzy logic to improve reliability, creating an extensible foundation for future natural human-robot collaboration research.

Why It Matters

It demonstrates a practical, low-cost blueprint for creating robots that understand natural spoken commands, moving us closer to intuitive human-machine teamwork.

Read Original Article

An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction

Why It Matters

Stay Ahead in AI