Open Source

We compressed 6 LLMs and found something surprising: they don't degrade the same way

New research shows LLMs degrade at wildly different rates when compressed, with major implications for deployment.

Deep Dive

A new study from Dystrio has upended conventional wisdom about compressing large language models. By systematically shrinking the MLP layers inside transformer architectures—without using quantization or custom kernels—the team measured accuracy drops across six models on benchmarks like ARC, HellaSwag, MMLU, and TruthfulQA. The surprising finding was that models degrade at wildly different rates. For example, at 14% compression, Google's Gemma 2B model held onto roughly 92% of its original accuracy, while Meta's Llama 3.1 8B fell to about 85%. This indicates there is no universal compression rule; each model has its own unique 'efficiency frontier.'

The research has immediate practical implications for developers. The compressed models are output as standard, dense Hugging Face checkpoints compatible with popular inference engines like vLLM and llama.cpp, and they can be stacked with further quantization. The data suggests task-specific deployment strategies: Retrieval-Augmented Generation (RAG) and chat applications can tolerate more compression, while models used for complex reasoning 'break' much faster, with MMLU scores dropping early. Dystrio is now exploring automated methods to find the optimal compression point for any given model and architecture, a process that currently takes about 25 minutes per model. This work provides a crucial roadmap for running more efficient, cost-effective AI by tailoring compression to the specific model and its intended use case.

Key Points
  • Gemma 2B retained ~92% accuracy at 14% MLP layer compression, showing high resilience.
  • Llama 3.1 8B performance dropped to ~85% at the same compression level, degrading much faster.
  • The study created standard HF checkpoints, enabling further quantization and use with vLLM/llama.cpp for efficient deployment.

Why It Matters

Enables developers to strategically compress models for specific tasks (RAG vs. reasoning), drastically reducing compute costs without uniform performance loss.