SimpleTool: Parallel Decoding for Real-Time LLM Function Calling
New method achieves 3-6x speedup for AI agents, hitting 61.2ms latency on consumer GPUs.
A research team led by Xiaoxin Shi has introduced SimpleTool, a novel parallel decoding method that dramatically accelerates LLM-based function calling for intelligent agents. The breakthrough addresses a fundamental bottleneck in real-time AI applications—autoregressive decoding latency—which has limited deployment in embodied intelligence, game AI, and interactive avatars requiring 10 Hz control frequencies. SimpleTool exploits two key observations about function calling: structured outputs contain substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. By designing special tokens that serve dual roles—compressing low-entropy tokens while acting as mode selectors—the system enables independent parallel generation of function names and arguments.
Experiments across five benchmarks using Qwen-series models (0.5B-14B) demonstrate substantial speed improvements while maintaining competitive or improved accuracy. The method achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. Notably, ST-Qwen-0.5B outperforms Google's FunctionGemma in both accuracy and latency consistency on the Mobile Actions benchmark. With quantization on consumer-grade GPUs, SimpleTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at the 4B model scale. This breakthrough bridges the gap between LLM function calling and latency-critical real-world deployment, making responsive AI agents practically feasible for the first time.
- Achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead
- Enables 16 Hz real-time control at 4B model scale with 61.2ms P50 latency on consumer GPUs
- ST-Qwen-0.5B outperforms Google's FunctionGemma in accuracy and latency consistency on Mobile Actions benchmark
Why It Matters
Makes AI agents practical for real-time applications like robotics, gaming, and interactive avatars by solving the latency bottleneck.