Mix-and-Match Pruning: Globally Guided Layer-Wise Sparsification of DNNs
New framework generates 10 pruning strategies per model, cutting Swin-Tiny's accuracy degradation by 40%.
A team of researchers from multiple institutions has published a new paper on arXiv titled "Mix-and-Match Pruning: Globally Guided Layer-Wise Sparsification of DNNs." The framework addresses a core challenge in deploying large AI models: compressing them for edge devices without sacrificing performance. Current pruning methods often apply a single strategy uniformly, which is suboptimal because different layers and architectures (like CNNs vs. Vision Transformers) respond differently. Mix-and-Match Pruning introduces a smarter, coordinated approach.
It works by first deriving architecture-aware sparsity ranges—for example, preserving crucial normalization layers while pruning classifier layers more aggressively. The system then leverages existing sensitivity signals, like weight magnitude or gradient information, to systematically sample these ranges and generate ten distinct, high-quality pruning strategies per signal. This eliminates the need for costly, repeated pruning experiments. In tests on models like Swin-Tiny, the framework demonstrated Pareto-optimal efficiency, reducing accuracy degradation by a significant 40% relative to standard single-criterion pruning. The findings suggest that better coordination of existing pruning signals is more effective than inventing new, complex criteria, offering a reliable path to smaller, faster models ready for real-world deployment.
- Generates 10 distinct pruning strategies per model using sensitivity signals like magnitude and gradient, eliminating repeated experimental runs.
- Reduces accuracy degradation on the Swin-Tiny Vision Transformer by 40% compared to standard single-strategy pruning methods.
- Uses architecture-aware rules, like preserving normalization layers, to create globally optimal sparsity configurations for CNNs and Transformers.
Why It Matters
Enables more efficient AI deployment on phones and IoT devices by creating significantly smaller models with minimal performance loss.