Super Apriel: One Checkpoint, Many Speeds
A single 15B-parameter model checkpoint can dynamically switch between four attention mechanisms for on-demand speed.
SLAM Labs has introduced Super Apriel, a novel 15-billion-parameter 'supernet' architecture that fundamentally changes how AI models balance speed and quality. Unlike a standard model with a fixed architecture, Super Apriel embeds four distinct 'mixer' or attention mechanisms—Full Attention (FA), Sliding Window Attention (SWA), Kimi Delta Attention (KDA), and Gated DeltaNet (GDN)—within each of its 48 decoder layers. At inference time, a serving system can dynamically select a specific 'placement,' choosing one mixer per layer to create a custom model configuration. This allows a single checkpoint to serve multiple speed presets on the fly, from a high-quality, slower FA mode to much faster hybrid configurations.
This flexibility yields dramatic performance gains. The recommended hybrid placements achieve decode throughput 2.9x to 10.7x faster than the full model, with quality retention ranging from 96% down to 77%. These speed advantages compound at longer context lengths. The model also enables speculative decoding—a technique to accelerate inference—without needing a separate draft model, as the faster placements within the same checkpoint can draft tokens for the slower ones. To navigate the vast configuration space (4^48 possibilities), the team developed a surrogate model that predicts placement quality, making it tractable to find the optimal speed-quality trade-off for any given task.
The research, trained via stochastic distillation from a frozen Apriel 1.6 teacher, reveals an important scaling insight: while optimal configurations stabilize quickly in smaller 0.5B models, they remain unstable in the full 15B model, cautioning against extrapolating efficiency findings from small-scale experiments. SLAM Labs is releasing the full package: the supernet weights, Fast-LLM training code, vLLM serving integration, and a placement optimization toolkit, providing the community with a powerful new paradigm for efficient model serving.
- Dynamic Speed Presets: A single 15B checkpoint serves multiple models, enabling 2.9x to 10.7x faster decoding by switching layer configurations at runtime.
- Integrated Speculative Decoding: Eliminates the need for a separate draft model, as faster internal placements can draft tokens for slower ones.
- Vast Configuration Space Managed: A surrogate model predicts quality from any of the 4^48 possible layer mixer assignments, identifying optimal speed-quality trade-offs.
Why It Matters
Enables AI providers to dynamically optimize inference cost and latency for different queries from a single deployed model, drastically improving serving efficiency.