MAVIC: New method lets multi-agent AI follow instructions mid-task
Researchers fix value collapse in multi-agent RL when interrupted by natural language commands.
A new paper from researchers introduces MAVIC (Macro-Action Value Correction for Instruction Compliance), addressing a critical failure mode in multi-agent reinforcement learning (MARL): when external natural language instructions interrupt an agent's ongoing macro-actions, standard Bellman updates couple value estimates across conflicting instruction contexts, leading to inconsistent values. MAVIC solves this by modifying the bootstrapping target itself at instruction boundaries—restoring the continuation value under the original objective after correcting for the incoming instruction. Unlike reward shaping, this approach directly adjusts the value estimation process, enabling consistent value learning even under stochastic instruction switches within a single policy.
The authors provide theoretical analysis proving that MAVIC converges to correct value functions, and implement it as an actor-critic algorithm. In increasingly complex cooperative multi-agent environments, MAVIC achieves high instruction compliance (following the given command) while preserving performance on the underlying long-horizon task. This work has practical implications for deploying multi-agent systems in dynamic real-world settings—such as warehouse robots or autonomous fleets—where human operators may need to issue real-time commands that temporarily override default behaviors.
- MAVIC corrects Bellman backups at instruction boundaries to prevent value estimation collapse when macro-actions are interrupted by external commands.
- Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching.
- The method achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.
Why It Matters
Enables reliable multi-agent AI that follows real-time human commands without sacrificing long-term objectives.