Developer Tools

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

New method generates multiple draft tokens in one pass, breaking a key bottleneck in fast inference.

Deep Dive

A new optimization called P-EAGLE (Parallel EAGLE) has been integrated into the popular vLLM inference engine, promising significant speedups for large language model (LLM) serving. The method enhances the existing EAGLE speculative decoding framework, which uses a smaller "drafter" model to predict several tokens ahead of the main LLM to reduce latency. However, EAGLE's drafting process was autoregressive, meaning it needed sequential forward passes—one for each token—creating a hidden bottleneck. P-EAGLE solves this by restructuring the drafter to generate all K draft tokens in a single, parallel forward pass, dramatically reducing overhead.

This architectural shift is now available in vLLM starting from version 0.16.0. Benchmarks on an NVIDIA B200 GPU show P-EAGLE achieving speedups of 1.05x to 1.69x over vanilla EAGLE-3 on models like GPT-OSS 20B across standard benchmarks. The integration is user-friendly: enabling it requires just a single configuration change (`parallel_drafting: true`) in the speculative config. Pre-trained P-EAGLE drafter heads are already hosted on HuggingFace for models including GPT-OSS 120B, GPT-OSS 20B, and Qwen3-Coder 30B, allowing teams to deploy the acceleration immediately.

The core innovation lies in P-EAGLE's two-step drafting process. After the target model processes the prompt, P-EAGLE captures its internal hidden states. The drafter then constructs parallel inputs for each draft position, using a combination of real token embeddings and these hidden states. For positions where future token data is unknown, it employs learned placeholder parameters. This allows the transformer layers to predict multiple future tokens simultaneously, breaking the linear scaling constraint of autoregressive drafting and unlocking higher speculation depths for greater speed.

Key Points
  • Generates all draft tokens in one parallel pass, removing EAGLE's sequential bottleneck.
  • Achieves up to 1.69x speedup over EAGLE-3 on GPT-OSS 20B using an NVIDIA B200.
  • Integrated into vLLM v0.16.0; enable with a config flag and use pre-trained heads on HuggingFace.

Why It Matters

Lowers inference cost and latency for production AI applications, making advanced LLMs more efficient to serve at scale.