LLM-ALSO: LLMs optimize rewards for multi-agent RL in sparse environments
New framework uses LLM critics to fix coordination failures in sparse-reward multi-agent tasks.
LLM-ALSO tackles a core challenge in multi-agent reinforcement learning (MARL): designing effective reward signals when feedback is sparse. Existing methods often require hand-crafted rewards or domain expertise, which don't scale. The proposed framework instead leverages large language models (LLMs) to automate reward shaping through a three-step iterative cycle: a Critic LLM diagnoses stage-specific miscoordination from sparse returns and behavior summaries, a Generator LLM proposes candidate reward-shaping rules, and a short-horizon branch validation filters unreliable modifications before they affect the main training trajectory.
In experiments on sparse-reward cooperative tasks (e.g., multi-agent navigation and predator-prey), LLM-ALSO consistently outperformed baselines in both final performance and sample efficiency. The framework achieves this without requiring offline data or human-engineered reward functions, instead relying on the LLM's ability to reason about emergent coordination failures. The paper provides 14 pages of analysis including ablation studies on the validation branch and stage-aware adaptation. While the approach introduces some computational overhead from LLM calls, the gains in sparse settings suggest a promising direction for scalable MARL training.
- LLM-ALSO iterates via three stages: Critic LLM diagnoses coordination failures, Generator LLM proposes reward shapes, and branch validation filters unreliable updates.
- Achieved higher sparse-evaluation performance and sample efficiency on cooperative MARL benchmarks like multi-agent predator-prey without manual reward engineering.
- Reduces risk of harmful LLM-generated modifications through short-horizon validation before integrating into the main training trajectory.
Why It Matters
Automates reward design for multi-agent systems, cutting manual effort and enabling LLMs to improve coordination in sparse-reward scenarios.