Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement
New system optimizes Mixture-of-Experts models for edge servers, slashing communication costs and latency.
A research team led by Tian Wu has introduced Prism, a novel inference framework designed to tackle the significant challenge of running massive Mixture-of-Experts (MoE) models on distributed edge hardware. MoE models, like many cutting-edge LLMs, achieve vast capacity by activating only a subset of specialized 'expert' neural networks per input, but their deployment is hampered by huge memory demands and communication overhead that overwhelm typical edge servers. Prism directly addresses this by optimizing where these experts are physically placed across a network of heterogeneous GPU-equipped edge devices, minimizing the data that needs to be sent between servers and thus reducing critical latency.
Prism's core innovation is its activation-aware placement strategy, which analyzes workload patterns to position experts strategically, balancing local request handling with available memory. It is supplemented by a runtime migration mechanism that dynamically re-distributes experts as workload patterns change. Experiments on contemporary MoE models and datasets show that this cooperative, edge-centric approach delivers substantial performance gains. The framework achieved up to a 30.6% reduction in inference latency and a significant decrease in communication costs compared to state-of-the-art baselines, proving that efficient, decentralized MoE serving is feasible outside the cloud.
- Prism framework reduces MoE model inference latency by up to 30.6% on edge hardware.
- Uses activation-aware expert placement and runtime migration to minimize inter-server communication.
- Enables practical deployment of large sparse models on distributed, resource-constrained edge devices.
Why It Matters
Enables faster, cheaper, and more private AI inference on local devices, reducing reliance on costly cloud infrastructure.