Research & Papers

Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

arXiv cs.CL April 24, 2026

⚡New method fine-tunes LLMs for real-time translation with superior quality and speed.

Deep Dive

Simultaneous speech translation (SST) demands generating translations from partial audio input, a task where large language models (LLMs) have recently shown promise but at high computational cost. Prior work reformulated SST as a multi-turn dialogue to reuse the LLM's key-value (KV) cache, slashing overhead, but relied on supervised fine-tuning (SFT) data that is scarce and often synthetically generated with poor quality.

To address this, researchers from NVIDIA and UC Santa Barbara introduce Hierarchical Policy Optimization (HPO), a post-training method that refines models trained on imperfect SFT data. HPO employs a hierarchical reward function that jointly optimizes translation quality (via COMET/MetricX) and latency, achieving over +7 COMET score and +1.25 MetricX improvement at a latency of just 1.5 seconds for English to Chinese, German, and Japanese. Comprehensive ablation studies confirm the effectiveness of different quality rewards and segmentation strategies, with code released on GitHub. This work was accepted as an oral at ACL 2026.

Key Points

HPO post-trains LLMs on imperfect SFT data using a hierarchical reward balancing translation quality and latency.
Achieves +7 COMET and +1.25 MetricX score improvements at 1.5 seconds latency for English to Chinese, German, and Japanese.
Enables full KV cache reuse for efficient LLM-based SST, reducing computational overhead significantly.

Why It Matters

Enables high-quality, low-latency real-time speech translation using LLMs, practical for live captioning and international meetings.

Read Original Article

Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

Why It Matters

Stay Ahead in AI