Research & Papers

StreamPro AI shifts video models from reactive to proactive reasoning, scoring 4x better

New benchmark forces AI to decide when to respond, not just what to say

Deep Dive

Current video understanding models follow a passive “see-then-answer” paradigm: they wait for clear evidence before responding. This reduces proactive reasoning to delayed perception. The new paper “StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video” by Ao Li and 9 co-authors tackles the harder problem of deciding when to speak, not just what to say. The authors introduce StreamPro-Bench, a benchmark that scores models on three axes: Perception Understanding, Temporal Reasoning, and Proactive Agency. The last measures a model’s ability to make early, yet reliable, decisions from incomplete streams—a critical skill for real-time applications like live surveillance, autonomous driving, or interactive assistants.

To train such models, the team proposes a two-stage framework also called StreamPro. First, they use CB-Stream Loss during supervised fine-tuning to mitigate the extreme imbalance between long periods of irrelevant silence and short, critical response windows. Second, they apply Group Relative Policy Optimization (GRPO) with a multi-grained reward design that penalizes both wrong answers and poor timing—optimizing correctness and decision delays jointly. Results show dramatic gains: StreamPro achieves 41.5 on its proactive benchmark, far exceeding the previous best of 10.4, while maintaining strong real-time performance (78.9 on StreamingBench-RTVU). The work signals a shift from passive video understanding to agents that can proactively engage with streaming video.

Key Points
  • StreamPro-Bench introduces Proactive Agency as a new metric, measuring early decision-making under partial observations
  • CB-Stream Loss addresses severe supervision imbalance between silence and response signals during training
  • GRPO with multi-grained rewards (turn-level and trajectory-level) jointly optimizes response accuracy and timing, achieving 41.5 vs prior 10.4 on the benchmark

Why It Matters

Proactive video AI is essential for real-time systems that must act before full evidence appears