DDPL uses diffusion models to fix multi-agent RL exploration limits
New algorithm beats Gaussian policy limitations across 4 benchmarks
Cooperative multi-agent reinforcement learning (MARL) struggles with exploration, especially as the number of agents grows. Standard decentralized softmax policy gradient (DecSPG) algorithms rely on Gaussian policies, whose limited expressiveness severely restricts exploration in high-dimensional action spaces. This bottleneck worsens with more agents, leading to poor coordination and suboptimal policies.
To overcome this, the team introduces Decentralized Diffusion Policy Learning (DDPL). Each agent's policy is parameterized by a denoising diffusion probabilistic model — an expressive generative model that captures multi-modal action distributions. DDPL enables efficient online training via a novel importance sampling score matching (ISSM) method with theoretical guarantees. In tests on multi-agent particle environments, multi-agent MuJoCo, IsaacLab, and JAX-reimplemented StarCraft, DDPL consistently improved performance over baselines, showing that diffusion policies can unlock new exploration capabilities for multi-agent systems.
- Standard Gaussian policies in decentralized MARL limit exploration, worsening with more agents
- DDPL replaces Gaussian policies with denoising diffusion probabilistic models for multi-modal actions
- New importance sampling score matching (ISSM) enables stable online training; tested on 4 benchmarks
Why It Matters
DDPL enables more effective exploration in multi-agent systems, critical for robotics, drone swarms, and autonomous driving.