Research & Papers

RCP: Representation Consistency Pruner for Mitigating Distribution Shift in Large Vision-Language Models

New pruning method removes 88.9% of visual tokens, slashing FLOPs by 85.7% while maintaining performance.

Deep Dive

A research team led by Jianwei Zhang and Pengcheng Zheng has introduced RCP (Representation Consistency Pruner), a breakthrough framework designed to dramatically reduce the computational cost of Large Vision-Language Models (LVLMs) like GPT-4V or LLaVA. These models suffer from prohibitive inference costs because they process massive numbers of visual tokens through their language decoders. Existing pruning methods often cause significant performance degradation due to distribution shift—when removing tokens alters the model's internal representations away from what it was trained on. RCP solves this by integrating two key innovations: a cross-attention pruner that uses the model's own attention patterns to predict which tokens to remove consistently across layers, and a Delayed Repair Adapter (DRA) that caches the essence of pruned information and applies it later during answer generation.

RCP's efficiency comes from training only lightweight plug-in modules while allowing physical token discarding at inference time. The Delayed Repair Adapter uses FiLM-based modulation specifically on answer generation tokens and employs a repair loss to match the statistical properties (first and second-order statistics) of the pruned representations with those of a full-token teacher model. Extensive experiments show RCP removes up to 88.9% of visual tokens and reduces computational FLOPs by up to 85.7%, with only a marginal average accuracy drop across LVLM benchmarks. The method outperforms prior state-of-the-art approaches that avoid fine-tuning the original model, making it a practical solution for deploying efficient vision-language AI in resource-constrained environments.

Key Points
  • Removes 88.9% of visual tokens and reduces FLOPs by 85.7% with minimal accuracy loss
  • Uses a cross-attention pruner and Delayed Repair Adapter (DRA) with FiLM-based modulation
  • Trains only lightweight plug-in modules, enabling physical token discarding at inference for real efficiency

Why It Matters

Enables efficient deployment of powerful vision-language AI on edge devices and in cost-sensitive applications by drastically cutting compute requirements.