Research & Papers

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

This new model can create smooth, real-time animations just from a text description.

Deep Dive

Researchers have introduced LLaMo, a new model that unifies motion understanding and generation using a modified LLM architecture. It encodes human motion into a continuous latent space, avoiding the jitter artifacts common in discrete tokenization methods. The model can perform text-to-motion generation and motion captioning, achieving real-time streaming generation at over 30 frames per second. This marks a significant step toward a general-purpose motion-language model.

Why It Matters

It enables high-fidelity, real-time animation generation from simple text prompts, advancing AI for gaming, robotics, and virtual reality.