TokenSpeed hits 580 tps on Qwen3.5-397B, setting new speed record for agentic AI
LightSeek's open-source engine outperforms TensorRT-LLM with advanced memory and kernel optimizations
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
The LightSeek Foundation has released TokenSpeed, an open-source LLM inference engine (MIT license) that achieves a record 580 tokens per second running the Qwen3.5-397B-A17B model on GPUs. Qwen3.5 uses a hybrid attention mechanism combining standard full attention layers with Gated Delta Network (GDN) linear attention layers, reducing computational complexity for long sequences while maintaining strong reasoning. TokenSpeed was designed from the ground up for agentic workloads, targeting performance comparable to TensorRT-LLM with the developer-friendliness of vLLM. Its native SPMD architecture and static compilation enable significant acceleration for complex multi-step agent tasks.
TokenSpeed achieves its extreme performance through systematic elimination of memory copies, advanced kernel fusions, and fully overlapped CPU-GPU execution that keeps the GPU saturated at all times. For agentic scenarios requiring multi-turn tool-calling with shared contexts, TokenSpeed provides full GDN-aware prefix caching—splitting management across C++ (logical cache with radix-tree matching and Mamba slot lifecycle) and Python (physical tensor management with copy-on-write and state snapshots). This hybrid prefix cache stores both KV cache pages and Mamba recurrent states, enabling efficient reuse across requests. The engine also supports Prefill-Decode disaggregation and unified state transfers, handling the complex serving patterns of real-world agentic applications.
- TokenSpeed achieves 580 tokens per second on Qwen3.5-397B-A17B, a new speed record for agentic inference on GPUs
- Optimizations include memory copy elimination, kernel fusion, and full CPU-GPU overlap to keep the GPU saturated
- Hybrid prefix caching with Mamba state support enables efficient reuse of shared contexts in multi-turn tool-calling workloads
Why It Matters
Enables real-time agentic AI with ultra-low latency, critical for complex multi-step tool-calling and enterprise AI agents.