Open Source

I benchmarked 17 local LLMs on real MCP tool calling — single-shot AND agentic loop. The difference is massive.

Agentic loops boost performance by 30%+ for models like IBM's Granite-4H-Tiny, revealing critical gap in single-shot testing.

Deep Dive

A comprehensive independent benchmark has evaluated 17 local large language models on real-world MCP (Model Context Protocol) tool calling, moving beyond synthetic function-calling evaluations. The test used a custom Python runner to connect models via LM Studio to a production Workunit project management API featuring 19 real tools. Each model was tested twice: first in a single-shot scenario (scoring the first API response) and then in an agentic loop where the model receives tool results and can continue iterating until success or timeout.

The methodology proved decisive. While single-shot testing showed severe limitations—especially for Level 2 'reasoning' tasks where models had to plan multi-step sequences—agentic loops unlocked dramatically better performance. IBM's Granite-4H-Tiny 7B model, which is explicitly tool-trained, saw its overall score rise from 73% to 89%. More surprisingly, models not trained for tool calling, like Baidu's 21B parameter Ernie-4.5, went from a 0% single-shot success rate to an 83% success rate in agentic mode, demonstrating that raw reasoning capability can compensate for lack of specific fine-tuning when given iterative feedback.

The benchmark included three difficulty levels across 28 tasks: explicit instructions (Level 0), natural language requests (Level 1), and high-level goal reasoning requiring planning (Level 2). Hardware constraints involved a single NVIDIA RTX 4080 16GB GPU with 64GB RAM. The results challenge conventional evaluation methods, suggesting that agentic capability—not just single-response accuracy—is a critical metric for assessing LLMs in practical tool-using applications. This has significant implications for developers choosing models for automation and agent workflows.

Key Points
  • Agentic testing (iterative feedback) boosted overall scores by an average of over 30% compared to single-shot evaluations, with IBM's Granite-4H-Tiny 7B reaching 89% success.
  • Models not specifically trained for tool calling, like Baidu's Ernie-4.5 21B, achieved 83% success in agentic mode despite 0% single-shot performance, showing reasoning can compensate.
  • The benchmark tested 17 models across 28 real API tasks using Workunit's MCP server with 19 tools, revealing Level 2 'reasoning' tasks as the major differentiator.

Why It Matters

For developers building AI agents, iterative feedback capability is more critical than single-response accuracy for real-world tool automation.