Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
New speculative sampling algorithm guarantees controlled divergence, boosting acceptance rates without degrading output quality.
Researchers Yongchang Hao and Lili Mou have introduced Cactus (Constrained Acceptance Speculative Sampling), a novel algorithm designed to speed up the text generation of large language models (LLMs). The method builds upon speculative sampling (SpS), a popular technique that uses a smaller, faster "draft" model to propose tokens which are then verified by a larger, more powerful model. Cactus addresses a key limitation of existing approaches: while standard SpS is overly strict, and typical acceptance sampling (TAS) can distort the target model's output, Cactus uses a formal constrained optimization framework to guarantee a controlled, acceptable divergence. This allows it to accept more draft tokens without compromising the quality encoded by the primary verifier LLM.
Empirical results across a range of benchmarks confirm Cactus's effectiveness, showing it successfully increases acceptance rates—and therefore decoding throughput—while maintaining output integrity. The paper, which is the camera-ready version accepted at the prestigious ICLR 2026 conference, provides a rigorous mathematical formulation that could influence future research into efficient inference. By optimizing the trade-off between speed and fidelity, Cactus represents a meaningful step toward making the operation of state-of-the-art models like GPT-4o or Claude 3 more cost-effective and responsive in real-world applications.
- Formalizes speculative sampling as a constrained optimization problem, providing a rigorous framework for acceleration.
- Guarantees controlled divergence from the verifier LLM's distribution, accepting more tokens than strict SpS without the quality degradation of TAS.
- Accepted at ICLR 2026, with empirical validation across benchmarks showing increased acceptance rates and maintained output quality.
Why It Matters
Enables faster, cheaper LLM inference for applications like chatbots and coding assistants without sacrificing response quality.