Research & Papers

LookWhen slashes video recognition cost 6.7x while matching accuracy

New selector-extractor framework skips redundant video tokens for 6.7x speedup.

Deep Dive

Transformers dominate video recognition but suffer from superlinear computational cost because they process every frame token. LookWhen, a new framework from Salamatian, Fuller, Sarkar, Green, Sigal, and Shelhamer, tackles this by factorizing video understanding into when, where, and what to compute. It uses a lightweight selector that operates on a downscaled video to quickly score all tokens across space-time, then passes only the top-K tokens to a deeper extractor. This approximates full-video representations without processing all tokens, dramatically reducing FLOPs.

To train the selector, the authors introduce a uniqueness score based on nearest-neighbor distance among token representations. For the extractor, they distill knowledge from both a video teacher and an image teacher, normalizing frame-wise representations to learn temporal changes. Evaluated on six datasets (Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, Charades), LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases and roughly matches on the remaining 3. More importantly, in real-world throughput, it is 6.7x faster than InternVideo2-B at the same accuracy. The work opens the door to practical, real-time video recognition on edge devices.

Key Points
  • LookWhen's shallow selector scores tokens from a scaled-down video, then the deep extractor processes only the top-K tokens, reducing computation.
  • On real-world throughput, LookWhen achieves 6.7x speedup over InternVideo2-B at equal accuracy, and Pareto-dominates in accuracy-FLOPs on 9 of 12 benchmark settings.
  • Pre-training uses a uniqueness score for token selection and distillation from video and image teachers to learn temporal changes.

Why It Matters

Enables efficient, real-time video recognition on resource-constrained devices without sacrificing accuracy.