YingMusic-Singer: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
New diffusion model changes lyrics automatically without manual alignment, beating baseline Vevo2.
A research team from multiple institutions, led by Chunbo Hao, has published a new paper on arXiv detailing YingMusic-Singer. This is a fully diffusion-based AI model designed for controllable singing voice synthesis. Its core innovation is enabling flexible lyric manipulation while preserving the original melody, all without requiring any laborious manual alignment between the new words and the music. The model operates by taking three key inputs: an optional timbre reference (to set the voice style), a melody-providing singing clip, and the user's modified lyrics.
The technical backbone of YingMusic-Singer involves training with curriculum learning and a novel Group Relative Policy Optimization technique. This approach allows it to achieve stronger melody preservation and lyric adherence than its most comparable baseline, Vevo2, which also supports melody control without manual alignment. To properly evaluate this new capability, the team also introduced LyricEditBench, the first dedicated benchmark for assessing melody-preserving lyric modification. The researchers have made the model's code, weights, benchmark, and demonstration samples publicly available, facilitating further research and application in AI-powered music creation.
- Fully diffusion-based model for singing voice synthesis with flexible lyric editing.
- Requires no manual alignment; uses a melody clip, timbre reference, and new lyrics as input.
- Outperforms the Vevo2 baseline and introduces the first evaluation benchmark, LyricEditBench.
Why It Matters
This technology could revolutionize music production, remixing, and content creation by making professional-grade vocal editing accessible and efficient.