Research & Papers

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

New VideoLLM framework solves latency and timing challenges for interactive AI that can comment on live streams.

Deep Dive

A research team from Microsoft Research Asia and Tsinghua University has introduced Proact-VL, a novel framework designed to shape multimodal language models into proactive, real-time interactive agents. The work addresses the critical gap in creating human-like AI companions that can perceive and interact with dynamic environments, such as live video streams, without disruptive delays. The researchers instantiated this concept through two gaming scenarios—a live commentator and an interactive guide—selected for their suitability for automatic evaluation. To support this, they also introduced the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, providing a standardized testbed for future real-time AI research.

The technical core of Proact-VL tackles three fundamental challenges for real-time interaction: achieving sub-second inference latency with continuous video input, implementing a mechanism for the AI to autonomously decide the optimal moment to interject (proactive response timing), and dynamically controlling the length and quality of its generated content to fit real-time constraints. Extensive experiments demonstrate that Proact-VL achieves superior balance between response speed and output quality while maintaining robust video understanding capabilities, moving beyond passive Q&A to active participation. This framework represents a significant step toward practical, low-latency AI companions for applications like live esports commentary, interactive tutoring, and real-time customer support in dynamic visual environments.

Key Points
  • Solves core latency challenge for AI watching live video, enabling sub-second response times for streaming input.
  • Introduces a 'proactive' mechanism for the AI to decide when to speak, not just respond to direct queries.
  • Validated on new 'Live Gaming Benchmark' dataset with commentary and guidance scenarios, showing strong performance.

Why It Matters

Paves the way for AI that can actively participate in live events, from sports commentary to interactive training and support.