Research & Papers

MAVEN scaffold boosts open-source AI tool-calling accuracy by 23%

A new lightweight symbolic scaffold pushes open-source models to match proprietary giants at 1/10 cost.

Deep Dive

MAVEN addresses a core challenge in agentic AI: generalization across tool-calling environments. Existing large language models often excel on individual benchmarks but fail when asked to compose reasoning strategies, preserve intermediate states, or coordinate multiple tools. MAVEN (Modular Agentic Verification and Execution Network) introduces a lightweight symbolic scaffold that decomposes tasks into structured steps, orchestrates tools adaptively, and performs intermediate verification. This approach requires no additional training, making it a drop-in upgrade for open-weight models like GPT-OSS-120b.

The research team evaluated MAVEN on established benchmarks (BFCL v3, TauBench, Tau2Bench, AceBench) and created MAVEN-Bench, a stress-test benchmark for multi-step mathematical and physical reasoning with adversarial task composition. On MAVEN-Bench, MAVEN lifted its base model from 48% to 71% accuracy—a 23-point gain. Despite using an open-weight backbone, MAVEN matched or exceeded proprietary frontier baselines while operating at an estimated cost ratio of roughly 1/10. These results suggest that lightweight verification scaffolds, rather than brute-force scaling, can enable reliable, cost-effective agentic reasoning in real-world deployments.

Key Points
  • MAVEN uses a symbolic reasoning scaffold for structured decomposition, adaptive tool orchestration, and intermediate verification without additional model training.
  • On the new MAVEN-Bench stress test, MAVEN improved GPT-OSS-120b accuracy from 48% to 71%—a 23 percentage point gain.
  • MAVEN matched proprietary frontier baselines while operating at roughly 1/10 the cost, making open-source tool-calling more affordable and reliable.

Why It Matters

MAVEN proves lightweight verification can make open-source AI cost-effective and reliable for complex, multi-step tool-calling tasks.