ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs
Training-free method cuts multimodal LLM compute by 6x while actually improving accuracy on video/image tasks.
A research team including An Yu, Ting Yu Tsai, and Ming-Ching Chang has introduced ReDiPrune (Relevance-Diversity Pre-Projection Token Pruning), a novel method to dramatically improve the efficiency of multimodal large language models (MLLMs) like LLaVA-NeXT. The core innovation lies in pruning visual tokens before they pass through the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection methods that work on compressed representations, ReDiPrune operates directly on vision encoder outputs, preserving crucial spatial and semantic information that would otherwise be lost.
ReDiPrune uses a lightweight scoring rule that evaluates each token based on both text-conditioned relevance and max-min diversity, ensuring selected tokens are both query-relevant and non-redundant. This training-free approach is fully plug-and-play, requiring no retraining or architectural modifications—it simply inserts between the encoder and projector. The team validated their method across four video and five image benchmarks, consistently improving the accuracy-efficiency trade-off.
The results are particularly striking for video understanding tasks. When applied to LLaVA-NeXT-Video-7B on the EgoSchema benchmark, ReDiPrune achieved a +2.0% absolute accuracy gain while retaining only 15% of visual tokens and reducing computational requirements by more than 6× in TFLOPs. This represents a rare win-win scenario where both performance and efficiency improve simultaneously, challenging the conventional wisdom that pruning necessarily sacrifices accuracy. The method's code is already available, offering immediate practical benefits for developers working with resource-intensive multimodal models.
- Cuts multimodal LLM computation by 6x while improving accuracy by 2.0% on EgoSchema benchmark
- Plug-and-play design requires no retraining—inserts between vision encoder and language projector
- Selects only 15% of visual tokens using joint relevance-diversity scoring for optimal information retention
Why It Matters
Enables real-time video analysis on consumer hardware while improving accuracy, making advanced multimodal AI more accessible.