Research & Papers

VITA-QinYu: First AI model that role-plays and sings with human-like expression

New open-source SLM masters role-playing and singing, beating peers by 7% on benchmarks.

Deep Dive

Researchers have unveiled VITA-QinYu, the first end-to-end spoken language model (SLM) that goes beyond natural conversation to handle both role-playing and singing generation. Unlike previous models that treat speech as a single stream, VITA-QinYu uses a hybrid speech-text paradigm that interleaves text-audio modeling with multi-codebook audio tokens. This design enriches paralinguistic representation—capturing tone, mood, and performance nuances—while keeping modalities separate to avoid interference. To train this model, the team built a comprehensive data generation pipeline, synthesizing 15.8K hours of natural conversation, role-playing, and singing data.

On objective benchmarks, VITA-QinYu demonstrates superior expressiveness: it outperforms competing SLMs by 7 percentage points on role-playing tasks and scores 0.13 points higher on a 5-point Mean Opinion Score (MOS) for singing quality. Simultaneously, it sets new state-of-the-art results on conversational accuracy and fluency, beating prior SLMs by 1.38 percentage points on the C3 benchmark and 4.98 percentage points on URO. The model is fully open-source, with code, weights, and a demo supporting streaming and full-duplex interaction—making it a significant milestone in expressive AI speech.

Key Points
  • First end-to-end SLM designed for both role-playing and singing, not just natural conversation.
  • Trained on 15.8K hours of synthetic data using a hybrid speech-text paradigm with multi-codebook tokens.
  • Outperforms peers by 7pp on role-playing, 0.13 MOS on singing, and achieves SOTA on C3 (+1.38pp) and URO (+4.98pp) benchmarks.

Why It Matters

Opens the door to truly expressive AI voices for virtual assistants, gaming, and entertainment.