ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs
New framework tackles cross-NUMA memory bottlenecks, unlocking high-performance AI on existing server hardware.
A research team from Harbin Institute of Technology has introduced ArcLight, a new lightweight architecture designed to run large language models (LLMs) far more efficiently on standard many-core CPUs. Current frameworks like vLLM or Hugging Face's Transformers fail to fully exploit modern server CPUs, which are often organized into multiple NUMA (Non-Uniform Memory Access) nodes. The significant overhead of accessing memory across these nodes creates a major performance bottleneck, limiting how fast models like Llama 3 or Mistral can generate text on ubiquitous server hardware.
ArcLight tackles this problem head-on with a ground-up redesign. Its core innovation is a combination of finely controlled tensor parallelism—splitting model computations strategically across CPU cores—and intelligent thread scheduling that minimizes costly cross-NUMA data movement. This approach effectively mitigates the 'memory access wall' that has capped performance. The results are substantial: in experiments, ArcLight achieved up to 46% higher inference throughput compared to mainstream frameworks, pushing the performance ceiling for CPU-based AI.
The architecture's lightweight nature and compatibility with arbitrary CPU devices make it particularly significant for real-world deployment. It enables organizations to run powerful LLM inference directly on their existing web server fleets and high-end networking equipment, potentially delaying or reducing the need for expensive GPU investments. By making better use of installed hardware, ArcLight could accelerate the integration of AI capabilities into a wide range of latency-tolerant enterprise applications and services.
- Achieves up to 46% higher inference throughput by optimizing for many-core CPU NUMA architecture
- Uses finely controlled tensor parallelism and smart scheduling to minimize cross-node memory access
- Maintains compatibility with standard CPU hardware, enabling efficient AI on existing servers
Why It Matters
Lowers the cost and hardware barrier to deploying LLMs at scale, making AI more accessible for enterprise applications on existing infrastructure.