PSViT cuts Spiking Vision Transformer size by 22.4% with structured pruning
New method prunes SViTs structurally, enabling efficient deployment on standard hardware.
Researchers from multiple institutions have introduced PSViT, a novel methodology for structurally pruning Spiking Vision Transformers (SViTs). SViTs are low-power vision models that achieve state-of-the-art performance but are too large for embedded devices. Existing compression techniques rely on unstructured pruning, which creates irregular sparsity patterns requiring specialized hardware to realize efficiency gains — limiting scalability. PSViT solves this by employing uniform channel-wise filter pruning, systematically removing entire filters (channels) that contribute least to accuracy. The method includes sensitivity analysis to evaluate each layer's pruning impact, then performs fine-grained channel-wise pruning tailored to the network architecture.
Experimental results on ImageNet-1K demonstrate PSViT achieves 22.4% memory savings via single-shot pruning while retaining high accuracy: 70.3% without fine-tuning and 72.8% with fine-tuning, versus the original 73.3%. This structured approach allows standard computing architectures (CPUs, GPUs, TPUs) to accelerate inference without custom hardware. The work represents a significant step toward deploying efficient SViTs in resource-constrained environments like mobile devices, drones, and IoT sensors. The paper includes 8 pages, 7 figures, 3 tables, and is available on arXiv (2606.03257).
- PSViT achieves 22.4% memory savings through single-shot structured pruning on SViTs.
- Accuracy remains within 3% of original (70.3% no fine-tune, 72.8% with fine-tune vs. 73.3% baseline).
- Uniform channel-wise pruning avoids need for specialized hardware; works on standard architectures.
Why It Matters
Enables efficient Spiking Vision Transformer deployment on embedded devices without custom hardware accelerators.