Research & Papers

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

A new co-evolution framework lets 8B LLMs outperform GPT-4 and Claude on long-horizon tasks.

Deep Dive

Researchers from the University of Maryland and collaborators have introduced COSPLEY, a novel co-evolution framework designed to enhance large language models (LLMs) for long-horizon interactive tasks. These tasks, such as complex game playing, require multi-step reasoning, chaining multiple skills over many timesteps, and robust decision-making under delayed rewards and partial observability. Traditional LLMs often struggle in these environments because they lack a mechanism to discover, retain, and reuse structured skills across episodes, leading to inconsistent performance. COSPLEY addresses this with two co-evolving agents: an LLM decision agent that retrieves skills from a learnable skill bank to guide action selection, and a skill bank agent that continuously extracts, refines, and updates reusable skills from the agent's unlabeled rollouts, complete with contracts for each skill.

In experiments across six diverse game environments, COSPLEY demonstrated impressive results. Using an 8B parameter base model, it achieved over 25.1% average reward improvement against four frontier LLM baselines on single-player game benchmarks, including comparisons with models like GPT-4 and Claude. The framework also remained competitive on multi-player social reasoning games, showcasing its versatility. The key innovation lies in the co-evolution loop: as the decision agent learns better skill retrieval and action generation, the skill bank agent continually improves the quality and relevance of stored skills. This creates a virtuous cycle that significantly boosts performance on long-horizon tasks without requiring human-labeled data or extensive fine-tuning.

Key Points
  • COSPLEY uses a co-evolution framework with two agents: an LLM decision agent and a skill bank agent that extracts reusable skills from unlabeled rollouts.
  • With an 8B parameter base model, COSPLEY achieves over 25.1% average reward improvement against four frontier LLM baselines on single-player game benchmarks.
  • The framework remains competitive on multi-player social reasoning games, demonstrating its versatility across different interactive environments.

Why It Matters

COSPLEY proves that smaller LLMs can outperform larger frontier models on complex tasks by dynamically discovering and reusing skills.