Research & Papers

CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

arXiv cs.DC April 01, 2026

⚡New system optimizes trillion-parameter AI model serving by eliminating wasteful expert replication.

Deep Dive

A research team led by Adrian Zhao has developed CRAFT (Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations), a novel framework that dramatically improves the efficiency of serving massive Mixture-of-Experts (MoE) language models. MoE architectures like those used in models with hundreds of billions to a trillion parameters distribute specialized "expert" sub-networks across multiple GPUs, but this creates token-level load imbalance during inference. Current solutions replicate popular experts to balance loads, but often waste precious GPU memory on replicas that provide minimal performance benefits.

CRAFT solves this by implementing fine-grained, per-layer replication decisions based on calculated benefit estimates, maximizing load balance within strict memory budgets. The system can be integrated into existing serving frameworks without model modifications or retraining. Evaluations show CRAFT delivers 1.14× average throughput improvement (up to 1.2×) over traditional replication methods, meaning AI providers can serve more users with the same hardware or reduce costs while maintaining performance for trillion-parameter models.

Key Points

Increases MoE model serving throughput by 14% on average (up to 20%) through optimized expert replication
Uses fine-grained, per-layer benefit estimation to allocate GPU memory more efficiently than current methods
Works with existing serving frameworks for models from hundreds of billions to a trillion parameters without retraining

Why It Matters

Enables more cost-effective deployment of trillion-parameter AI models, reducing infrastructure costs for companies like OpenAI and Anthropic.

Read Original Article

CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

Why It Matters

Stay Ahead in AI