Agent Frameworks

Distributed RL from human feedback scales multi-agent learning without central control

New algorithm lets agents learn from human preferences using only local state information...

Deep Dive

A new paper from Dai, Wang, Wang, Qin, and Yu tackles the challenge of scaling reinforcement learning from human feedback (RLHF) to multi-agent systems. Existing RLHF methods rely on centralized training and single-agent settings, limiting their use in large, networked environments like robot swarms or traffic networks. The authors propose a distributed zeroth-order policy gradient algorithm where each agent only uses local state-action information from its κ-hop neighborhood to compute gradients from human preference feedback on H-horizon trajectory pairs. This eliminates the need for explicit reward signals or a central coordinator.

The algorithm is fully distributed: agents sample perturbed joint policies from a Gaussian distribution, collect preference feedback on their local trajectories, and update their own policy. Theoretical analysis proves convergence to an ε-stationary point with polynomial sample complexity. Experiments in stochastic GridWorld and predator-prey environments show the method matches or outperforms centralized baselines while scaling to hundreds of agents. This work opens the door for practical multi-agent RLHF in real-world decentralized systems.

Key Points
  • Fully distributed algorithm uses only local κ-hop neighborhood state-action information
  • Converges to ε-stationary point with polynomial sample complexity (no explicit reward needed)
  • Validated on GridWorld and predator-prey environments – scales to large agent networks

Why It Matters

Enables scalable multi-agent learning from human preferences without central servers, critical for decentralized robotics and IoT.