Fre-Res compresses video tokens without sacrificing accuracy in MLLMs
New frequency-domain technique cuts video tokens while preserving spatial and temporal details
Video multimodal large language models (MLLMs) face a fundamental trade-off: capturing fine-grained spatial details requires many tokens per frame, while tracking brief events demands dense temporal sampling—creating a bandwidth bottleneck. Fre-Res, developed by researchers at the National University of Defense Technology and South China University of Technology, solves this by separating the two information streams. It preserves a sparse set of high-fidelity spatial anchors (for object and layout details) and compresses dense temporal evolution into compact 'residual-frequency' tokens. The framework applies a temporal 1D-DCT (discrete cosine transform) to inter-frame residual trajectories in the vision-latent space, where it observes strong low-frequency concentration—meaning most temporal change can be encoded efficiently.
To bridge frequency-domain dynamics with standard visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information directly into corresponding spatial anchor tokens. Evaluated on fine-grained short- and long-video reasoning benchmarks (e.g., activity recognition, event understanding), Fre-Res achieves accuracy comparable to full-token models while drastically cutting token count. Ablation studies confirm that temporal-frequency residuals preserve causal transition cues (e.g., object movement, scene changes), while spatial anchors remain essential for static detail reasoning. This makes Fre-Res a practical solution for deploying video MLLMs in resource-constrained environments without sacrificing comprehension quality.
- Fre-Res uses a dual-track approach: sparse spatial anchors for layout + compact frequency-domain tokens for temporal dynamics.
- Achieves accuracy matching full-token models on multiple video reasoning benchmarks while substantially reducing visual-token length.
- Introduces a Spatial-Guided Absorber to align frequency-domain temporal information with native visual embeddings.
Why It Matters
Enables efficient video MLLMs by slashing token overhead, making real-time video analysis feasible on limited hardware.