Research & Papers

New video tokenizer drops redundant tokens for 31x speedup

31x faster than ElasticTok-CV with no auxiliary networks needed.

Deep Dive

A new paper on arXiv proposes a highly efficient approach to adaptive video tokenisation that eliminates the computational overhead of prior methods. Current continuous-regime techniques rely on iterative binarised searches or trained neural regressors to allocate token budgets, while discrete methods often require a full-rate decoder pass. The authors demonstrate that the latent space of a frozen continuous video tokeniser inherently encodes temporal redundancy. By applying a fixed threshold to per-position temporal-L1 differences in latent representations, they can identify and drop spatial positions whose latent vectors change minimally between consecutive frames, as these carry near-zero additional information. This yields a content-driven compression rate: static scenes compress aggressively, while dynamic sequences retain more tokens. No auxiliary routing networks or iterative searches are needed—just a single encoder pass and one forward pass through a lightweight factorised spatial-temporal attention architecture called the Latent Inpainting Transformer (LIT).

Evaluated on TokenBench and DAVIS, the framework achieves meaningful, content-driven token allocation with competitive reconstruction fidelity. Crucially, it delivers a 31x inference-time speedup over the continuous adaptive baseline ElasticTok-CV and a 2x speedup over the discrete information-theoretic baseline InfoTok. The approach is parameter-free and does not require retraining the base tokeniser, making it straightforward to integrate into existing video processing pipelines. This work could significantly reduce the compute cost of video understanding, compression, and generation tasks, especially for applications dealing with long or static-heavy footage.

Key Points
  • Parameter-free adaptive token allocation using a fixed threshold on temporal L1 differences in latent space.
  • Latent Inpainting Transformer (LIT) reconstructs dropped positions with a lightweight factorised spatial-temporal attention block.
  • 31x faster than ElasticTok-CV and 2x faster than InfoTok on standard benchmarks (TokenBench, DAVIS).

Why It Matters

Enables near-instant adaptive video tokenisation, cutting inference cost by 31x for efficient video AI.

📬 Get the top 10 AI stories daily