SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
Three 4B-parameter agents trained with SAT beat Qwen3-32B by 3.9% on math benchmarks.
A new paper from Yi Xie and colleagues (published at AAMAS 2026) introduces SAT (Sequential Agent Tuning), a training method that lets you deploy teams of smaller LLMs as a coordinated ensemble — without needing a central controller. The key innovation: SAT treats the multi-agent team as a factorized policy and updates agents sequentially via block-coordinate optimization. Each agent is trained with a sequence-aware advantage estimator and a per-agent KL trust region, which prevents the distribution shifts that normally plague multi-agent training. This setup provides two theoretical guarantees: monotonic improvement during training, and plug-and-play invariance — meaning any agent can be swapped for a stronger model without retraining the rest of the team, with a formal bound on performance improvement.
Empirically, SAT delivers striking results. A team of three 4B-parameter agents (12B total parameters) trained with SAT surpasses the much larger Qwen3-32B model on the AIME24/25 math benchmarks by an average of 3.9%. Even more telling: when the researchers replaced two of the 4B agents with 8B agents (without retraining the remaining team), the composite score jumped by 10.4% — validating the plug-and-play theory. This suggests SAT could massively lower AI deployment costs: instead of running one monolithic model, organizations can use a modular team of smaller, cheaper models that collectively match or exceed cutting-edge performance. The code and proofs are available on arXiv.
- SAT trains multi-LLM teams sequentially without a coordinator, using per-agent KL trust regions to avoid distribution shifts.
- Three 4B agents (12B total) beat the single Qwen3-32B model by 3.9% on AIME24/25 math benchmarks.
- Plug-and-play invariance lets users swap any agent for a stronger one without retraining, boosting composite score by 10.4% when upgrading two agents to 8B.
Why It Matters
SAT makes smaller, cheaper LLM teams a viable alternative to monolithic giant models, cutting deployment costs.