Research & Papers

SimpleTool: Parallel Decoding for Real-Time LLM Function Calling

New method achieves 3-6x speedup for AI agents, hitting 61.2ms latency on consumer GPUs.

Deep Dive

A research team led by Xiaoxin Shi has introduced SimpleTool, a novel parallel decoding method that dramatically accelerates LLM-based function calling for intelligent agents. The breakthrough addresses a fundamental bottleneck in real-time AI applications—autoregressive decoding latency—which has limited deployment in embodied intelligence, game AI, and interactive avatars requiring 10 Hz control frequencies. SimpleTool exploits two key observations about function calling: structured outputs contain substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. By designing special tokens that serve dual roles—compressing low-entropy tokens while acting as mode selectors—the system enables independent parallel generation of function names and arguments.

Experiments across five benchmarks using Qwen-series models (0.5B-14B) demonstrate substantial speed improvements while maintaining competitive or improved accuracy. The method achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. Notably, ST-Qwen-0.5B outperforms Google's FunctionGemma in both accuracy and latency consistency on the Mobile Actions benchmark. With quantization on consumer-grade GPUs, SimpleTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at the 4B model scale. This breakthrough bridges the gap between LLM function calling and latency-critical real-world deployment, making responsive AI agents practically feasible for the first time.

Key Points
  • Achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead
  • Enables 16 Hz real-time control at 4B model scale with 61.2ms P50 latency on consumer GPUs
  • ST-Qwen-0.5B outperforms Google's FunctionGemma in accuracy and latency consistency on Mobile Actions benchmark

Why It Matters

Makes AI agents practical for real-time applications like robotics, gaming, and interactive avatars by solving the latency bottleneck.