Research & Papers

AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

arXiv cs.AI May 04, 2026

⚡New 30-task benchmark reveals small open models rival GPT-5 on short-horizon tool use.

Deep Dive

AgentFloor is a new deterministic benchmark designed to answer a practical question: which parts of an agent workflow truly need frontier intelligence? The benchmark spans 30 tasks across a six-tier capability ladder, from instruction following and structured tool use to multi-step coordination and long-horizon planning under persistent constraints. Researchers from the paper tested 16 open-weight models ranging from 0.27B to 32B parameters alongside GPT-5, running 16,542 scored evaluations.

The results draw a clear boundary: small and mid-sized open-weight models already handle much of the short-horizon, structured tool use that dominates real agent pipelines. In aggregate, the strongest open-weight model matches GPT-5 on the benchmark while being substantially cheaper and faster to run. The frontier models still hold an advantage on long-horizon planning tasks, but neither side reaches high reliability. Importantly, the boundary isn't explained by scale alone—some failures respond to targeted interventions, but effects are model-specific. The paper suggests a practical design principle: use smaller models for the broad base of routine actions, and reserve frontier models for the narrower class of tasks demanding deeper planning and control.

Key Points

AgentFloor consists of 30 tasks across 6 tiers, from instruction following to long-horizon planning.
Small open-weight models (0.27B-32B) achieve parity with GPT-5 on short-horizon, structured tool use that dominates real agent pipelines.
Frontier models still outperform on long-horizon planning requiring persistent constraint tracking, but neither side reaches high reliability.

Why It Matters

Enables cost-effective agent design: route routine tasks to cheap small models, reserve expensive frontier models for complex planning.

Read Original Article

AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

Why It Matters

Stay Ahead in AI