Audio & Speech

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

arXiv eess.AS March 30, 2026

⚡A new unified model generates synchronized speech and co-speech gestures from text, improving on sequential methods.

Deep Dive

A research team from KTH Royal Institute of Technology, Sorbonne Université, and others has unveiled Gelina, a novel AI framework designed to tackle the challenge of multimodal human communication synthesis. Traditional computational methods generate speech and co-speech gestures sequentially, which often weakens the crucial synchrony and prosodic alignment between the two. Gelina's key innovation is its unified architecture that processes interleaved token sequences representing both speech and gestures within a single discrete autoregressive model backbone, paired with modality-specific decoders. This allows for truly joint generation from a text input, promising more natural and tightly coupled outputs.

Gelina offers several practical capabilities beyond basic synthesis. It supports multi-speaker and multi-style voice cloning, enabling the generation of content in different vocal characteristics. A notable feature is its ability to perform gesture-only synthesis directly from a speech audio input, which could be useful for animating existing dialogue. The paper, accepted at the ICASSP 2026 conference, reports that both subjective listener evaluations and objective metrics demonstrate that Gelina achieves competitive speech quality while significantly improving upon the gesture generation quality of previous unimodal baseline systems.

Key Points

Unified model architecture uses interleaved token prediction in a single backbone to jointly generate speech and gestures from text.
Enables multi-speaker/style cloning and can synthesize gestures directly from speech audio inputs for animation.
Outperforms sequential, unimodal baselines in evaluations, producing more natural and synchronized multimodal communication.

Why It Matters

This enables more lifelike and expressive virtual avatars, digital assistants, and content for entertainment and accessibility.

Read Original Article

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

Why It Matters

Stay Ahead in AI