Method uses delayed per-step reward attribution with eligibility gating to handle entangled multi-agent outcomes?

Method uses delayed per-step reward attribution with eligibility gating to handle entangled multi-agent outcomes.

An 8B open-source model outperformed GPT-5 and larger proprietary systems in head-to-head play?

An 8B open-source model outperformed GPT-5 and larger proprietary systems in head-to-head play.

Won first place in both the Open and Efficient tracks at NeurIPS 2025 MindGames Arena?

Won first place in both the Open and Efficient tracks at NeurIPS 2025 MindGames Arena.

Research & Papers

In2AI's training method beats GPT-5 with 8B parameter model

arXiv cs.AI June 02, 2026

⚡Delayed per-step reward attribution enables 8B model to outperform GPT-5 at NeurIPS.

Deep Dive

Training language model agents for multi-agent strategic interaction is notoriously difficult because the quality of any action depends on future events, illegal moves, or other agents' decisions. Standard reinforcement learning assumes per-step rewards, but that breaks down when outcomes are entangled across time and agents. Researchers from In2AI (Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov) solved this with a novel method: delayed per-step reward attribution with eligibility gating. They compute rewards only at episode end, propagate them back to originating steps using task-specific semantics, and exclude invalid steps from training.

Combined with asynchronous rollout generation via vLLM's continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, the technique delivers stable, sample-efficient RL training in multi-agent environments. The team evaluated on MindGames Arena at NeurIPS 2025, where a single 8-billion-parameter open-source model matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play. It took first place in both the Open (unrestricted) and Efficient (≤8B parameters) tracks, proving that smarter algorithms can overcome size disadvantages.

Key Points

Method uses delayed per-step reward attribution with eligibility gating to handle entangled multi-agent outcomes.
An 8B open-source model outperformed GPT-5 and larger proprietary systems in head-to-head play.
Won first place in both the Open and Efficient tracks at NeurIPS 2025 MindGames Arena.

Why It Matters

Shows smaller open-source models can beat giants like GPT-5 with smarter training techniques.

Read Original Article

In2AI's training method beats GPT-5 with 8B parameter model

Why It Matters

Related Articles

🚀 Stay Ahead in AI