Research & Papers

In2AI's training method beats GPT-5 with 8B parameter model

Delayed per-step reward attribution enables 8B model to outperform GPT-5 at NeurIPS.

Deep Dive

Training language model agents for multi-agent strategic interaction is notoriously difficult because the quality of any action depends on future events, illegal moves, or other agents' decisions. Standard reinforcement learning assumes per-step rewards, but that breaks down when outcomes are entangled across time and agents. Researchers from In2AI (Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov) solved this with a novel method: delayed per-step reward attribution with eligibility gating. They compute rewards only at episode end, propagate them back to originating steps using task-specific semantics, and exclude invalid steps from training.

Combined with asynchronous rollout generation via vLLM's continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, the technique delivers stable, sample-efficient RL training in multi-agent environments. The team evaluated on MindGames Arena at NeurIPS 2025, where a single 8-billion-parameter open-source model matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play. It took first place in both the Open (unrestricted) and Efficient (≤8B parameters) tracks, proving that smarter algorithms can overcome size disadvantages.

Key Points
  • Method uses delayed per-step reward attribution with eligibility gating to handle entangled multi-agent outcomes.
  • An 8B open-source model outperformed GPT-5 and larger proprietary systems in head-to-head play.
  • Won first place in both the Open and Efficient tracks at NeurIPS 2025 MindGames Arena.

Why It Matters

Shows smaller open-source models can beat giants like GPT-5 with smarter training techniques.