Research & Papers

ConFu: Contemplate the Future for Better Speculative Sampling

New speculative decoding framework uses 'contemplate tokens' to anticipate generation direction, beating EAGLE-3.

Deep Dive

A research team from UCLA and USC has introduced ConFu (Contemplate the Future), a breakthrough speculative decoding framework that significantly accelerates large language model inference. Unlike traditional methods where lightweight draft models only condition on the current text prefix—leading to error accumulation and prediction drift—ConFu enables draft models to anticipate the future direction of generation. The system achieves this through three key innovations: contemplate tokens and soft prompts that provide future-oriented signals from the target model at negligible cost, a dynamic contemplate token mechanism using Mixture of Experts (MoE) for context-aware future prediction, and a novel training framework with anchor token sampling and future prediction replication.

Experiments demonstrate that ConFu consistently outperforms the current state-of-the-art EAGLE-3 framework, improving token acceptance rates and overall generation speed by 8-11% across various downstream tasks when tested with Llama-3 3B and 8B models. The paper, accepted at the ICLR 2026 workshop on Latent & Implicit Thinking, represents the first work to bridge speculative decoding with continuous reasoning tokens. This approach offers a new direction for making LLM inference more efficient without sacrificing output quality, potentially reducing computational costs and latency for real-time applications.

Key Points
  • Improves token acceptance rates by 8-11% over EAGLE-3 on Llama-3 3B/8B models
  • Uses 'contemplate tokens' and soft prompts to predict future generation direction
  • First framework to bridge speculative decoding with continuous reasoning tokens

Why It Matters

Enables faster, cheaper LLM inference for real-time applications, reducing computational costs significantly.