Research & Papers

AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

New 30-task benchmark reveals small open models rival GPT-5 on short-horizon tool use.

Deep Dive

AgentFloor is a new deterministic benchmark designed to answer a practical question: which parts of an agent workflow truly need frontier intelligence? The benchmark spans 30 tasks across a six-tier capability ladder, from instruction following and structured tool use to multi-step coordination and long-horizon planning under persistent constraints. Researchers from the paper tested 16 open-weight models ranging from 0.27B to 32B parameters alongside GPT-5, running 16,542 scored evaluations.

The results draw a clear boundary: small and mid-sized open-weight models already handle much of the short-horizon, structured tool use that dominates real agent pipelines. In aggregate, the strongest open-weight model matches GPT-5 on the benchmark while being substantially cheaper and faster to run. The frontier models still hold an advantage on long-horizon planning tasks, but neither side reaches high reliability. Importantly, the boundary isn't explained by scale alone—some failures respond to targeted interventions, but effects are model-specific. The paper suggests a practical design principle: use smaller models for the broad base of routine actions, and reserve frontier models for the narrower class of tasks demanding deeper planning and control.

Key Points
  • AgentFloor consists of 30 tasks across 6 tiers, from instruction following to long-horizon planning.
  • Small open-weight models (0.27B-32B) achieve parity with GPT-5 on short-horizon, structured tool use that dominates real agent pipelines.
  • Frontier models still outperform on long-horizon planning requiring persistent constraint tracking, but neither side reaches high reliability.

Why It Matters

Enables cost-effective agent design: route routine tasks to cheap small models, reserve expensive frontier models for complex planning.