Borderless Long Speech Synthesis
New framework uses Chain-of-Thought reasoning and a hierarchical annotation schema to generate multi-speaker, emotionally dynamic audio.
A research team of 15 authors, led by Xingchen Song, has published a paper introducing the Borderless Long Speech Synthesis framework. This new approach tackles the limitations of current text-to-speech (TTS) systems, which typically synthesize speech sentence-by-sentence or from plain text alone, lacking understanding of global context. The proposed framework is designed as a unified capability set for agent-centric, long audio synthesis, spanning tasks like VoiceDesigner, multi-speaker synthesis, and long-form text synthesis.
On the data side, the team proposes a 'Labeling over filtering' strategy with a novel 'Global-Sentence-Token' annotation schema. This top-down, multi-level annotation provides rich contextual information. For the model, they use a backbone with a continuous tokenizer and enhance it with Chain-of-Thought (CoT) reasoning and Dimension Dropout, which significantly improves instruction following for complex audio generation tasks.
The core innovation is the system's 'Native Agentic' design. The hierarchical annotation schema doubles as a 'Structured Semantic Interface' between a front-end LLM Agent and the synthesis engine. This creates a layered control protocol that spans from high-level scene semantics down to phonetic details. Effectively, text becomes a wide-band control channel, allowing an LLM to convert inputs from any modality—text, code, or other data—into structured commands for generating intricate audio scenes with multiple speakers, interruptions, and emotional progression.
- Uses a novel 'Global-Sentence-Token' hierarchical annotation schema for rich contextual data, moving beyond simple text filtering.
- Integrates Chain-of-Thought reasoning and Dimension Dropout into the model architecture to handle complex, multi-condition generation instructions.
- Designed as a 'Native Agentic' system where the annotation schema acts as a Structured Semantic Interface for LLMs, enabling Any-Modality to complex audio synthesis.
Why It Matters
This bridges the gap between LLM reasoning and high-fidelity audio generation, enabling AI agents to create dynamic, multi-character audio content for media, gaming, and interactive applications.