A new method makes large AI models smaller and faster to run
A new technique slashes AI model size and speeds up responses by nearly 40%.
Researchers introduced MoP, a new method for compressing large language models. Unlike previous techniques that only trimmed model depth or width, MoP combines both approaches in an iterative process. This method outperformed other compression techniques on models like LLaMA-2 and LLaMA-3, achieving a 39% reduction in processing latency at 40% compression. It also proved effective on a vision-language model, maintaining performance with simple text-based fine-tuning.
Why It Matters
This makes powerful AI models more efficient and accessible for use on everyday devices.