CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs
Fixes credit assignment in multi-agent LLM systems, enabling better collaboration and routing.
Researchers Stela Tong and Elai Ben-Gal have introduced CoFi-PGMA (Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs), a unified framework designed to train large language models in multi-agent architectures. As LLMs are increasingly deployed in systems where multiple models either compete through routing mechanisms or collaborate to produce a single answer, the learning signal each agent receives becomes filtered. In routing, only the chosen response is evaluated (selection-gated feedback), while collaboration yields shared rewards that mask individual contributions. Standard RLHF objectives, tailored for single-policy deployments, fail in these scenarios, leading to misspecified learning.
CoFi-PGMA addresses this by deriving a counterfactual per-agent training objective based on marginal contribution. For routing systems, this translates to off-policy corrections for selection-gated feedback, while for collaborative systems, it reduces to leave-one-out difference rewards for credit assignment. The framework further analyzes how softmax routing induces risk-sensitive incentives and provides practical training algorithms that integrate counterfactual estimators, multiturn-aware rewards, and policy optimization methods. Demonstrated on a real-world reasoning dataset, CoFi-PGMA enables more effective training of multi-agent LLM systems, improving their ability to collaborate or compete efficiently.
- CoFi-PGMA corrects learning signals for multi-agent LLMs under routing (selection-gated) and collaborative (shared rewards) feedback.
- Derives counterfactual per-agent objectives based on marginal contribution, using off-policy corrections for routing and leave-one-out difference rewards for collaboration.
- Includes practical algorithms with counterfactual estimators, multiturn-aware rewards, and policy optimization, validated on real-world reasoning datasets.
Why It Matters
Enables effective training for multi-agent LLM systems, critical for scaling AI collaboration and routing in production.