PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System
This new hardware architecture could slash AI inference costs overnight...
Researchers have proposed PAM (Processing Across Memory), a new hardware system designed to solve the critical bottleneck of Key-Value (KV) cache storage and attention computation in LLM serving. By intelligently distributing KV tokens across a hierarchy of memory devices and introducing a novel PAMattention algorithm, the system aims to simultaneously satisfy massive bandwidth and capacity demands. This addresses the core inefficiency in current systems optimized for compute, not memory-intensive operations.
Why It Matters
It could dramatically lower the cost and increase the speed of running models like GPT-4 and Claude for millions of users.