[R] Depth-first pruning transfers: GPT-2 → TinyLlama with stable gains and minimal loss
New method removes specific transformer layers instead of shrinking all, creating smaller, faster models.
A new pruning technique called depth-first pruning is challenging conventional model compression approaches by selectively removing entire transformer layers rather than uniformly shrinking model dimensions. Developed through experimentation on GPT-2 and validated on TinyLlama 1.1B with full 3-seed replication, the method identifies and removes the least important layers based on sensitivity analysis. Results show removing 8-12% of layers (going from 22 to 19-20 layers) increases perplexity by only 6-8% while maintaining extremely stable performance (±0.01 PPL variance across seeds).
What makes this approach particularly valuable is its transferability across different model architectures—working effectively on both GPT-2 and Llama-family models—and its practical benefits beyond mere parameter reduction. Unlike traditional width pruning that degrades all layers uniformly, depth-first pruning preserves the most useful structural components while eliminating redundant computation. This produces real inference speedups rather than just theoretical parameter savings, making it a clean, reproducible efficiency method that could reshape how we approach model optimization.
The technique represents a shift from "make every layer smaller" to "remove the layers that matter least," emphasizing structural efficiency over uniform scaling. While not claiming state-of-the-art status or introducing a new architecture, the method demonstrates that not all transformer layers contribute equally to model performance. This insight could lead to more efficient model discovery and deployment strategies, particularly valuable for memory-constrained applications where every percentage of size reduction matters.
- Removes 8-12% of transformer layers with only 6-8% quality degradation
- Transfers across architectures (GPT-2 → TinyLlama) with ±0.01 PPL variance
- Produces real inference speedups, not just parameter savings
Why It Matters
Enables deployment of smaller, faster AI models with minimal quality loss, reducing computational costs for real-world applications.