TeamTR boosts multi-agent LLM coordination by 7.1% with trust-region fine-tuning
New framework fixes sequential fine-tuning failures that made multi-agent teams underperform single models.
A new paper from Yi Xie and colleagues, accepted at ICML 2026, tackles a fundamental flaw in multi-agent LLM systems: why teams of fine-tuned models often underperform a single model. The authors identify 'compounding occupancy shift'—when agents are fine-tuned sequentially, updating one agent shifts the team's context distribution. Evaluating subsequent updates on cached rollouts then incurs a penalty that scales quadratically with the number of agents. In contrast, evaluating on fresh rollouts (intermediate occupancy) reduces scaling to linear.
To solve this, TeamTR uses a trust-region approach: after each agent update, it resamples trajectories from the new team configuration and enforces per-agent divergence constraints (like KL divergence). This guarantees monotonic improvement per update and per stage. Experiments show TeamTR outperforms single-agent baselines by 7.1% on average, eliminates coordination regressions, and allows plug-and-play component replacement without retraining from scratch. The code is open-sourced.
- Compounding occupancy shift causes quadratic penalty in multi-agent fine-tuning; TeamTR reduces it to linear with fresh rollouts.
- TeamTR enforces per-agent trust-region constraints, guaranteeing monotonic improvement with 7.1% average gain over baselines.
- Supports plug-and-play replacement of individual agents without full retraining, enabling modular team upgrades.
Why It Matters
Enables reliable multi-agent LLM teams that outperform single models—critical for complex tasks like code generation or robotics.