AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression
The toolkit introduces HY-1.8B-int2, the first industrially viable 2-bit large language model.
Tencent's Hunyuan AI team has introduced AngelSlim, a comprehensive open-source toolkit designed to make large model compression more accessible and efficient for both research and industrial deployment. The toolkit consolidates state-of-the-art compression techniques—including quantization, speculative decoding, token pruning, and distillation—into a unified pipeline. A standout achievement is the creation of HY-1.8B-int2, presented as the first industrially viable 2-bit large language model, pushing the boundaries of ultra-low-bit quantization. This release addresses a critical industry need: reducing the massive computational cost and latency of deploying models like GPT-4, Claude 3, or Llama 3 at scale.
Technically, AngelSlim integrates advanced FP8 and INT8 Post-Training Quantization (PTQ) and proposes a novel training-aligned speculative decoding framework compatible with multimodal models, delivering 1.8x to 2.0x throughput gains without output degradation. For long-context scenarios, it includes a training-free sparse attention framework that decouples sparse kernels from model architectures to reduce Time-to-First-Token (TTFT). The toolkit also offers specialized multimodal pruning: IDPruner for vision tokens using Maximal Marginal Relevance and Samp for adaptive audio token merging. By providing these low-level implementations, AngelSlim enables algorithm-focused research and significantly lowers the barrier to deploying compressed, high-performance models in production environments.
- Introduces HY-1.8B-int2 as the first viable 2-bit LLM for industry, enabling extreme model size reduction.
- Speculative decoding framework boosts inference throughput by 1.8x to 2.0x for multimodal models without compromising output quality.
- Provides unified pipeline with quantization, pruning, and distillation, streamlining compression from research to deployment.
Why It Matters
Dramatically lowers the cost and latency of running massive AI models, making advanced AI more accessible for real-world applications.