ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
New framework reduces duplicate phrases by over 50% while optimizing natural turn-taking in voice AI.
A research team from National Taiwan University, led by Chi-Yuan Hsiao and Hung-yi Lee, has introduced ASPIRin (Action Space Projection for Interactivity-Optimized Reinforcement Learning), a novel framework designed to solve a critical problem in conversational AI. Current end-to-end full-duplex Speech Language Models (SLMs) that use standard reinforcement learning (RL) on raw tokens often suffer from 'generative collapse'—where optimizing for temporal dynamics like turn-taking degrades semantic quality, leading to robotic repetition and incoherent speech. ASPIRin's breakthrough is explicitly decoupling the decision of when to speak from what to say.
ASPIRin maps the complex text vocabulary into a simplified, binary action space of 'active speech' versus 'inactive silence.' This Action Space Projection allows a separate optimization process. The framework then applies Group Relative Policy Optimization (GRPO) with carefully designed rule-based rewards to balance key interaction metrics like user interruption handling and response latency. Empirical results show ASPIRin successfully optimizes interactivity across turn-taking, backchanneling (e.g., "mm-hmm"), and pause handling. Most importantly, by isolating timing control from content generation, it preserves semantic coherence and slashes the portion of duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating the degenerative repetition that plagues current voice AI.
- Decouples timing from content: Uses Action Space Projection to separate 'when to speak' (binary state) from 'what to say' (text generation).
- Cuts repetition by 50%+: Reduces duplicate phrases (n-grams) by over half compared to standard reinforcement learning, preventing generative collapse.
- Optimizes natural interaction: Applies GRPO with rule-based rewards to improve turn-taking, backchanneling, and pause handling in full-duplex speech models.
Why It Matters
Enables more natural, less robotic voice assistants and AI companions by fixing awkward pauses and repetitive speech patterns.