Unsloth claims 12x faster MoE training with 35% less VRAM
This breakthrough could make massive MoE models accessible on consumer GPUs.
Unsloth AI has released custom Triton kernels that reportedly enable 12x faster training and over 35% less VRAM usage for Mixture of Experts (MoE) models with no accuracy loss. The optimizations support models like Qwen3-30B and GPT-OSS-20B, which can now fine-tune in just 12.8GB of VRAM. The efficiency scales with model size, and the kernels work on data-center and consumer GPUs like the RTX 3090.
Why It Matters
This dramatically lowers the cost and hardware barrier for developers and researchers to experiment with and deploy state-of-the-art MoE architectures.