Developer Tools

Large model inference container – latest capabilities and performance enhancements

New caching system reuses KV caches for repeated text, slashing costs for 10M-token models.

Deep Dive

AWS has released major updates to its Large Model Inference (LMI) container, headlined by the integration of LMCache, a technology designed to tackle the soaring costs of running modern, long-context LLMs. As frontier models now support up to 10 million tokens for complex RAG and coding agent tasks, processing repetitive documents across multiple queries has become a major performance and cost bottleneck. LMCache addresses this by intelligently caching the Key-Value (KV) activations for frequently reused text spans—not just prefixes—allowing these precomputed states to be shared across inference engines and user sessions. This transforms the economics of long-context workloads by reducing redundant computation.

The system operates at a chunk level, automatically identifying 'hot spots' of repeated content and storing their KV caches across GPU memory, CPU RAM, or NVMe storage. AWS has streamlined deployment with a low-code interface for automatic LMCache configuration. In benchmarks using a Qwen model on a p4de.24xlarge instance with 460,000 tokens of repeated documents, LMCache with CPU offloading delivered a 2.65x speedup in total request latency (from 52.978s to 24.274s) and significantly faster Time to First Token (TTFT). For production, AWS recommends pairing LMCache with session-based sticky routing on Amazon SageMaker to maximize cache hits, enabling organizations to deploy million-token contexts cost-effectively for the first time.

Key Points
  • LMCache caches KV activations for repeated text chunks, not just prefixes, enabling reuse across queries and engines.
  • Benchmarks show a 2.65x total latency improvement (52.978s to 24.274s) on a 460K-token Qwen model workload with CPU offloading.
  • Enables efficient deployment of models with up to 10M-token contexts for RAG systems and coding agents by offloading cache to CPU/NVMe.

Why It Matters

Dramatically reduces the cost and latency of running long-context AI applications, making advanced RAG and code-agent systems commercially viable.