Scenema Audio enables any voice to perform any emotion via diffusion
Separates 'who' from 'how' — zero-shot emotional voice cloning for any speaker.
Scenema.ai has open-sourced Scenema Audio, a diffusion-based speech generation model that achieves zero-shot expressive voice cloning. Unlike traditional TTS systems which tie vocal identity to a fixed emotional range, Scenema Audio treats emotional performance and voice identity as independent dimensions. You provide a short reference audio clip to establish the 'who' (the speaker's voice) and a descriptive text prompt for the 'how' (e.g., 'whispering nervously' or 'shouting with joy'). This allows any voice to convincingly deliver any emotion, even if that speaker has never been recorded in that state. The model is sensitive to prompt engineering—generic descriptions yield generic output, while theatrical prompts with action tags produce vivid performances. A pace parameter controls speaking rate. Output quality varies by seed; repetition and gibberish can occur, so the recommended workflow is generative: produce multiple takes, pick the best, and trim. Despite these quirks, the team prefers Scenema Audio over more controllable alternatives like Gemini 3.1 Flash TTS because diffusion-generated speech sounds significantly more natural and less robotic, especially for emotional delivery.
The model is distributed as a Docker container with a REST API mirroring scenema.ai's production setup. It features automatic VRAM detection: on a 16 GB GPU it loads the INT8 quantized model (4.9 GB) and streams audio to the CPU, requiring 32 GB system RAM. The diffusion pass itself is fast—the team reduced denoising steps from 50 to 8 without quality loss—but total generation time is dominated by other pipeline components. Complex proper nouns (e.g., 'Tchaikovsky') benefit from phonetic spelling because the model lacks a pronunciation dictionary. Scenema Audio also supports an audio-first video generation workflow: generate the speech first, then feed it into an A2V pipeline (LTX 2.3, Wan 2.6, Seedance 2.0) to create video that matches the performance. This approach yields more coherent lip-sync and emotional alignment than generating video directly.
- Zero-shot voice cloning: no training needed; any voice can perform any emotion by combining reference audio and a text prompt.
- Diffusion-based architecture produces more natural emotional speech than autoregressive TTS (e.g., Gemini 3.1 Flash) but requires multiple seeds for best results.
- Docker REST API with auto VRAM management; INT8 model runs on 16 GB VRAM with CPU streaming, uses 32 GB system RAM.
Why It Matters
Transforms video production by enabling realistic emotional voiceovers for any character without needing voice actors.