TokenSpeed achieves 580 tokens per second on Qwen3.5-397B-A17B, a new speed record for agentic inference on GPUs?

TokenSpeed achieves 580 tokens per second on Qwen3.5-397B-A17B, a new speed record for agentic inference on GPUs

Optimizations include memory copy elimination, kernel fusion, and full CPU-GPU overlap to keep the GPU saturated?

Optimizations include memory copy elimination, kernel fusion, and full CPU-GPU overlap to keep the GPU saturated

Hybrid prefix caching with Mamba state support enables efficient reuse of shared contexts in multi-turn tool-calling workloads?

Hybrid prefix caching with Mamba state support enables efficient reuse of shared contexts in multi-turn tool-calling workloads

Developer Tools

TokenSpeed hits 580 tps on Qwen3.5-397B, setting new speed record for agentic AI

PyTorch Blog May 27, 2026

⚡LightSeek's open-source engine outperforms TensorRT-LLM with advanced memory and kernel optimizations

Deep Dive

The LightSeek Foundation has released TokenSpeed, an open-source LLM inference engine (MIT license) that achieves a record 580 tokens per second running the Qwen3.5-397B-A17B model on GPUs. Qwen3.5 uses a hybrid attention mechanism combining standard full attention layers with Gated Delta Network (GDN) linear attention layers, reducing computational complexity for long sequences while maintaining strong reasoning. TokenSpeed was designed from the ground up for agentic workloads, targeting performance comparable to TensorRT-LLM with the developer-friendliness of vLLM. Its native SPMD architecture and static compilation enable significant acceleration for complex multi-step agent tasks.

TokenSpeed achieves its extreme performance through systematic elimination of memory copies, advanced kernel fusions, and fully overlapped CPU-GPU execution that keeps the GPU saturated at all times. For agentic scenarios requiring multi-turn tool-calling with shared contexts, TokenSpeed provides full GDN-aware prefix caching—splitting management across C++ (logical cache with radix-tree matching and Mamba slot lifecycle) and Python (physical tensor management with copy-on-write and state snapshots). This hybrid prefix cache stores both KV cache pages and Mamba recurrent states, enabling efficient reuse across requests. The engine also supports Prefill-Decode disaggregation and unified state transfers, handling the complex serving patterns of real-world agentic applications.

Key Points

TokenSpeed achieves 580 tokens per second on Qwen3.5-397B-A17B, a new speed record for agentic inference on GPUs
Optimizations include memory copy elimination, kernel fusion, and full CPU-GPU overlap to keep the GPU saturated
Hybrid prefix caching with Mamba state support enables efficient reuse of shared contexts in multi-turn tool-calling workloads

Why It Matters

Enables real-time agentic AI with ultra-low latency, critical for complex multi-step tool-calling and enterprise AI agents.

Read Original Article

TokenSpeed hits 580 tps on Qwen3.5-397B, setting new speed record for agentic AI

Why It Matters

Related Articles

🚀 Stay Ahead in AI