Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination
New AI agents generate internal dialogue to mimic human behavior diversity, enabling real-time steering without retraining.
Harvard researchers have introduced MIMIC (Modeling Inner Motivations for Imitation and Control), a novel framework that enables AI agents to generate internal linguistic representations of behavioral intent, inspired by human cognitive processes where inner speech guides action selection. Presented as a spotlight paper at NeurIPS 2025, MIMIC addresses critical limitations in current imitation learning methods, which struggle to capture the non-Markovian nature and inherent diversity of human behavior while lacking real-time steerability. The framework represents a significant advancement toward creating AI collaborators that can adapt dynamically to human partners in complex coordination tasks.
The technical architecture employs vision-language models as linguistic scaffolding to train a conditional variational autoencoder that generates inner speech from observations, paired with a diffusion-based behavior cloning policy that selects actions conditioned on both current observations and the generated speech. This dual approach allows for fine-grained behavioral steering at inference time simply by conditioning the agent on behavior-specific natural language commands. Experiments across robotic manipulation and human-AI collaboration games demonstrated that MIMIC agents achieve significantly higher behavior diversity and fidelity to human demonstrations while enabling nuanced control without requiring additional training data. The team has open-sourced their code and pre-trained agents, potentially accelerating development in human-AI teaming applications from manufacturing to gaming.
- Uses vision-language models to generate "inner speech" guiding AI decision-making, inspired by human cognitive theory
- Enables 40% higher behavior diversity in robotic manipulation tasks compared to standard imitation learning
- Allows real-time steering of AI agents with natural language commands without retraining or additional demonstrations
Why It Matters
Enables more natural and adaptive AI teammates for robotics, gaming, and collaborative work environments where human-like flexibility is crucial.