Image & Video

Scenema Audio’s zero-shot voice cloning lets any voice perform any emotion

Diffusion-based TTS achieves independent emotional control, beating Gemini 3.1 Flash in naturalness.

Deep Dive

Scenema Audio introduces a novel approach to voice cloning by treating emotional performance and voice identity as independent attributes. Users provide a reference audio for the desired voice (the 'who') and a textual prompt describing the emotional delivery (the 'how')—rage, grief, excitement, or even a child's wonder. This zero-shot capability means any voice can express any emotion, even if that specific emotional state was never recorded in the reference. The model is diffusion-based, not a traditional TTS pipeline, which gives it a more natural, less robotic sound compared to autoregressive systems like Gemini 3.1 Flash TTS. However, it comes with trade-offs: outputs can suffer from repetition or gibberish on some seeds, requiring a post-editing workflow where users generate multiple takes and select the best one.

Beyond standalone speech generation, Scenema Audio is designed for integration into video production. The team demonstrates an audio-first workflow: generate the voice performance, then feed it into an A2V (audio-to-video) pipeline using models like LTX 2.3, Wan 2.6, or Seedance 2.0 to create matching video content. Prompting is critical—generic descriptions yield generic outputs, while theatrical, action-tagged prompts produce compelling performances. A pace parameter controls words-per-second, and complex words benefit from phonetic spelling (e.g., 'Tchaikovsky' as 'Chai-koff-skee'). The model is packaged as a Docker container with a REST API and automatic VRAM management, supporting configurations from 16 GB (using INT8 quantization) to higher-end GPUs, making it accessible for production use.

Key Points
  • Emotional performance and voice identity are independent; any voice can perform any emotion with zero-shot cloning.
  • Diffusion-based output sounds more natural than autoregressive TTS (e.g., Gemini 3.1 Flash), especially for emotional delivery.
  • Ships as a Docker REST API with automatic VRAM management (16 GB INT8, 24 GB, etc.), supporting post-editing workflows and A2V pipeline integration.

Why It Matters

Enables natural, emotionally controlled voice cloning for video production without pre-recording every emotion.