CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations
New system optimizes trillion-parameter AI model serving by eliminating wasteful expert replication.
A research team led by Adrian Zhao has developed CRAFT (Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations), a novel framework that dramatically improves the efficiency of serving massive Mixture-of-Experts (MoE) language models. MoE architectures like those used in models with hundreds of billions to a trillion parameters distribute specialized "expert" sub-networks across multiple GPUs, but this creates token-level load imbalance during inference. Current solutions replicate popular experts to balance loads, but often waste precious GPU memory on replicas that provide minimal performance benefits.
CRAFT solves this by implementing fine-grained, per-layer replication decisions based on calculated benefit estimates, maximizing load balance within strict memory budgets. The system can be integrated into existing serving frameworks without model modifications or retraining. Evaluations show CRAFT delivers 1.14× average throughput improvement (up to 1.2×) over traditional replication methods, meaning AI providers can serve more users with the same hardware or reduce costs while maintaining performance for trillion-parameter models.
- Increases MoE model serving throughput by 14% on average (up to 20%) through optimized expert replication
- Uses fine-grained, per-layer benefit estimation to allocate GPU memory more efficiently than current methods
- Works with existing serving frameworks for models from hundreds of billions to a trillion parameters without retraining
Why It Matters
Enables more cost-effective deployment of trillion-parameter AI models, reducing infrastructure costs for companies like OpenAI and Anthropic.