TorchSpec: Speculative Decoding Training at Scale
Eliminates massive disk storage needs by streaming hidden states via RDMA, enabling scalable speculative decoding training.
TorchSpec has developed a novel framework that fundamentally rethinks how draft models are trained for speculative decoding, a critical technique for accelerating large language model inference. The system addresses the growing bottleneck of handling massive hidden states from target models like Kimi K2.5, where a single 128K-token training sample requires approximately 7GB of hidden state data. Traditional approaches either consumed enormous disk storage with severe I/O pressure or required co-located training that tied draft model parallelism to target model configurations and created significant GPU memory constraints.
TorchSpec's breakthrough lies in its disaggregated architecture that separates the inference system generating hidden states from the training system consuming them. Instead of writing to disk, hidden states stream directly from inference engine groups to training worker groups through a central Mooncake store using RDMA (Remote Direct Memory Access) or TCP protocols. This design eliminates disk storage requirements while allowing inference and training resources to scale independently, overcoming the rigid sharding and memory pressure limitations of co-located approaches.
The framework has demonstrated impressive results in production-scale training. Using TorchSpec, researchers successfully trained a Kimi K2.5 EAGLE-3 draft model with 1,500 H200 GPU hours, scaling to 600,000 training samples comprising 6 billion tokens. The resulting draft model shows strong benchmark performance, and when deployed with a lookahead of 3 tokens, delivers throughput improvements of over 60% at batch size 1, 30% at batch size 8, and 26% at batch size 16.
This represents a significant advancement for organizations deploying frontier models with hundreds of billions of parameters and million-token contexts. By solving the hidden state transfer bottleneck, TorchSpec enables more efficient training of high-performance draft models that can dramatically accelerate inference for models like Kimi K2.5, GLM 5, and Qwen 3.5, making large-scale LLM deployment more practical and cost-effective.
- Eliminates 7GB-per-sample disk storage by streaming hidden states via RDMA/TCP through Mooncake store
- Trained Kimi K2.5 EAGLE-3 draft model with 6B tokens using 1,500 H200 GPU hours
- Achieves +60% throughput at batch size 1 and +30% at batch size 8 with 3-token lookahead
Why It Matters
Enables practical deployment of frontier LLMs by dramatically reducing inference costs and latency through efficient speculative decoding training.