Research & Papers

Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC

New architecture removes host CPU from AI inference path, eliminating performance degradation from CPU interference.

Deep Dive

A team of researchers from KTH Royal Institute of Technology and other institutions has published a groundbreaking paper on arXiv introducing Blink, an end-to-end serving architecture designed to remove the host CPU from the critical path of Large Language Model (LLM) inference. Current systems like TensorRT-LLM and vLLM rely on the CPU for request handling, scheduling, and KV-cache management, making performance vulnerable to interference and forcing operators to reserve unused CPU capacity. Blink fundamentally redesigns this stack by delegating responsibilities to specialized hardware: a SmartNIC handles incoming requests and uses RDMA to place data directly into GPU memory, while a persistent GPU kernel manages batching, scheduling, and the KV-cache autonomously.

This CPU-free approach yields dramatic performance improvements. In evaluations against leading frameworks, Blink reduced the P99 Time To First Token (TTFT) by up to 8.47x and the P99 Time Per Output Token (TPOT) by up to 3.40x. It also improved decode throughput by 2.1x and slashed energy consumption per token by 48.6%. Crucially, under simulated CPU interference, existing systems degraded by up to two orders of magnitude, while Blink's performance remained stable. This resilience allows for aggressive application colocation in datacenters without sacrificing LLM service quality.

The architecture represents a significant shift towards more efficient and predictable AI infrastructure. By treating the GPU and SmartNIC as a cohesive, self-orchestrating unit, Blink addresses a core bottleneck in modern AI serving. The research highlights the potential of hardware/software co-design to unlock new levels of efficiency, which is critical as LLM inference becomes a dominant datacenter workload. The paper's findings could influence the next generation of inference servers and cloud AI services.

Key Points
  • Eliminates CPU bottleneck by offloading orchestration to SmartNIC and persistent GPU kernel, achieving up to 8.47x faster first-token latency.
  • Reduces energy consumption per token by 48.6% and maintains stable performance under CPU interference where other systems degrade severely.
  • Outperforms frameworks like vLLM and TensorRT-LLM, improving decode throughput by 2.1x and enabling more efficient server colocation.

Why It Matters

Enables more predictable, efficient, and scalable AI inference in datacenters, directly reducing cloud costs and improving service reliability.