Research & Papers

ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

arXiv cs.CV March 27, 2026

⚡Training-free method cuts multimodal LLM compute by 6x while actually improving accuracy on video/image tasks.

Deep Dive

A research team including An Yu, Ting Yu Tsai, and Ming-Ching Chang has introduced ReDiPrune (Relevance-Diversity Pre-Projection Token Pruning), a novel method to dramatically improve the efficiency of multimodal large language models (MLLMs) like LLaVA-NeXT. The core innovation lies in pruning visual tokens before they pass through the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection methods that work on compressed representations, ReDiPrune operates directly on vision encoder outputs, preserving crucial spatial and semantic information that would otherwise be lost.

ReDiPrune uses a lightweight scoring rule that evaluates each token based on both text-conditioned relevance and max-min diversity, ensuring selected tokens are both query-relevant and non-redundant. This training-free approach is fully plug-and-play, requiring no retraining or architectural modifications—it simply inserts between the encoder and projector. The team validated their method across four video and five image benchmarks, consistently improving the accuracy-efficiency trade-off.

The results are particularly striking for video understanding tasks. When applied to LLaVA-NeXT-Video-7B on the EgoSchema benchmark, ReDiPrune achieved a +2.0% absolute accuracy gain while retaining only 15% of visual tokens and reducing computational requirements by more than 6× in TFLOPs. This represents a rare win-win scenario where both performance and efficiency improve simultaneously, challenging the conventional wisdom that pruning necessarily sacrifices accuracy. The method's code is already available, offering immediate practical benefits for developers working with resource-intensive multimodal models.

Key Points

Cuts multimodal LLM computation by 6x while improving accuracy by 2.0% on EgoSchema benchmark
Plug-and-play design requires no retraining—inserts between vision encoder and language projector
Selects only 15% of visual tokens using joint relevance-diversity scoring for optimal information retention

Why It Matters

Enables real-time video analysis on consumer hardware while improving accuracy, making advanced multimodal AI more accessible.

Read Original Article

ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

Why It Matters

Stay Ahead in AI