Research & Papers

OTPrune: Distribution-Aligned Visual Token Pruning via Optimal Transport

A new training-free method cuts MLLM inference costs by aligning token distributions, preserving accuracy while removing redundancy.

Deep Dive

A research team led by Xiwen Chen has introduced OTPrune, a novel framework designed to tackle the high computational cost of multi-modal large language models (MLLMs) like GPT-4V or LLaVA. These models, which combine vision and language understanding, process images by breaking them into hundreds of visual tokens, many of which are redundant. OTPrune formulates token pruning as a distribution alignment problem using optimal transport (OT), specifically minimizing the 2-Wasserstein distance between the full and pruned token sets. This mathematical approach ensures the pruned subset preserves both the local diversity and global representativeness of the original visual information, leading to more stable and semantically faithful compression than previous heuristic methods.

The key innovation is a derived tractable, submodular optimization objective that enables efficient, training-free pruning. The team provides theoretical guarantees of the method's monotonicity and submodularity, offering a principled foundation absent in prior work. Comprehensive experiments across standard benchmarks show OTPrune achieves superior performance-efficiency trade-offs, significantly accelerating inference while minimizing accuracy loss compared to state-of-the-art methods. Accepted to CVPR 2026, this work provides a robust, theoretically grounded tool for deploying efficient vision-language AI in real-world applications, from robotics to content analysis, where speed and cost are critical.

Key Points
  • Uses Optimal Transport theory to align token distributions, preserving semantic information during pruning
  • Achieves superior performance-efficiency tradeoffs on benchmarks, accelerating MLLM inference without retraining
  • Provides theoretical guarantees (monotonicity, submodularity) for stable and efficient optimization, accepted to CVPR 2026

Why It Matters

Enables faster, cheaper deployment of vision-language AI in applications like autonomous systems and real-time content analysis.