Open Source

GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost

Zhipu AI's new model reaches Opus 4.6 performance for just $0.40 per agent task run.

Deep Dive

Zhipu AI's GLM 5.1 is emerging as a formidable and cost-efficient contender in the AI agent space. Independent testing using the OpenClaw platform—which evaluates models on real user-submitted tasks in a simulated environment—found that GLM 5.1 performs at the same level as Anthropic's flagship Claude Opus 4.6 model. The critical differentiator is cost: where an average agentic task run costs about $1.20 with Opus, GLM 5.1 completes it for just $0.40, a roughly 67% reduction. This performance pushes the frontier of cost-effectiveness for sophisticated AI agents.

The testing methodology, designed to avoid static benchmark optimization, uses a 'Chatbot Arena' style battle where LLMs judge each other's performance on practical tasks. While GLM 5.1's per-token price is less than one-fifth of Opus, it uses about twice as many tokens per task because it makes more than twice as many tool calls, acting more aggressively. This actually highlights its robust agentic capabilities. The model outperformed all others tested except Opus, with Alibaba's Qwen 3.6 also showing promise pending prompt caching support for further cost reduction. The results suggest GLM 5.1 is now a top practical choice for frameworks like OpenClaw.

Key Points
  • GLM 5.1 matches Claude Opus 4.6 performance in real-world agentic benchmarks using the OpenClaw testing platform.
  • It achieves this at about one-third the cost per task run (~$0.40 vs Opus's ~$1.20), despite using more tokens due to aggressive tool calling.
  • The testing avoids static benchmarks, using user-submitted tasks and LLM-as-judge battles to evaluate practical, agentic usefulness.

Why It Matters

This drastically lowers the cost of building and running sophisticated AI agents, making advanced automation accessible to more developers and companies.