Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech
New method fine-tunes LLMs for real-time translation with superior quality and speed.
Simultaneous speech translation (SST) demands generating translations from partial audio input, a task where large language models (LLMs) have recently shown promise but at high computational cost. Prior work reformulated SST as a multi-turn dialogue to reuse the LLM's key-value (KV) cache, slashing overhead, but relied on supervised fine-tuning (SFT) data that is scarce and often synthetically generated with poor quality.
To address this, researchers from NVIDIA and UC Santa Barbara introduce Hierarchical Policy Optimization (HPO), a post-training method that refines models trained on imperfect SFT data. HPO employs a hierarchical reward function that jointly optimizes translation quality (via COMET/MetricX) and latency, achieving over +7 COMET score and +1.25 MetricX improvement at a latency of just 1.5 seconds for English to Chinese, German, and Japanese. Comprehensive ablation studies confirm the effectiveness of different quality rewards and segmentation strategies, with code released on GitHub. This work was accepted as an oral at ACL 2026.
- HPO post-trains LLMs on imperfect SFT data using a hierarchical reward balancing translation quality and latency.
- Achieves +7 COMET and +1.25 MetricX score improvements at 1.5 seconds latency for English to Chinese, German, and Japanese.
- Enables full KV cache reuse for efficient LLM-based SST, reducing computational overhead significantly.
Why It Matters
Enables high-quality, low-latency real-time speech translation using LLMs, practical for live captioning and international meetings.