Agent Frameworks

DDPL uses diffusion models to fix multi-agent RL exploration limits

New algorithm beats Gaussian policy limitations across 4 benchmarks

Deep Dive

Cooperative multi-agent reinforcement learning (MARL) struggles with exploration, especially as the number of agents grows. Standard decentralized softmax policy gradient (DecSPG) algorithms rely on Gaussian policies, whose limited expressiveness severely restricts exploration in high-dimensional action spaces. This bottleneck worsens with more agents, leading to poor coordination and suboptimal policies.

To overcome this, the team introduces Decentralized Diffusion Policy Learning (DDPL). Each agent's policy is parameterized by a denoising diffusion probabilistic model — an expressive generative model that captures multi-modal action distributions. DDPL enables efficient online training via a novel importance sampling score matching (ISSM) method with theoretical guarantees. In tests on multi-agent particle environments, multi-agent MuJoCo, IsaacLab, and JAX-reimplemented StarCraft, DDPL consistently improved performance over baselines, showing that diffusion policies can unlock new exploration capabilities for multi-agent systems.

Key Points
  • Standard Gaussian policies in decentralized MARL limit exploration, worsening with more agents
  • DDPL replaces Gaussian policies with denoising diffusion probabilistic models for multi-modal actions
  • New importance sampling score matching (ISSM) enables stable online training; tested on 4 benchmarks

Why It Matters

DDPL enables more effective exploration in multi-agent systems, critical for robotics, drone swarms, and autonomous driving.