Research & Papers

Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

New method slashes compute costs for video-language models by intelligently selecting only 12.5% of visual tokens.

Deep Dive

A research team led by Jiaqi Li has introduced SemVID, a novel training-free framework that dramatically accelerates Video Temporal Grounding (VTG) tasks. VTG involves pinpointing specific moments in long, untrimmed videos based on text queries, a process that typically requires expensive video-language models to process massive amounts of visual data. SemVID tackles this inefficiency by implementing intelligent token pruning, selectively keeping only 12.5% of visual tokens while preserving critical information. The breakthrough lies in its semantic allocation strategy, which is guided by two core principles identified for VTG: Evidence Retention (to keep query-critical patches, especially around event boundaries) and Connectivity Strength (to preserve token-level connections across frames for long-range reasoning).

SemVID operates by first allocating per-frame token budgets, balancing query relevance and inter-frame variation to prevent over-pruning crucial segments. It then strategically selects three complementary token types: object tokens for diverse evidence related to the query, motion tokens to capture meaningful transitions and act as cross-frame relays, and a small set of context tokens to maintain scene continuity. This approach ensures the "evidence chain" necessary for accurate temporal localization remains intact. Extensive testing on standard VTG benchmarks shows SemVID retains up to 95.4% of the mean Intersection over Union (mIoU) accuracy achieved by processing all tokens. More impressively, it delivers up to a 5.8x speedup in the model's prefill phase—the computationally heavy initial processing step—consistently outperforming prior token-pruning methods under identical efficiency constraints.

Key Points
  • Achieves 5.8x prefill speedup by pruning 87.5% of visual tokens, keeping only 12.5%.
  • Retains 95.4% mIoU accuracy on Video Temporal Grounding benchmarks without any model retraining.
  • Uses semantic allocation to select object, motion, and context tokens, preserving critical evidence chains.

Why It Matters

Makes querying long videos with AI models like VLMs vastly more practical and cost-effective for real-world applications.