GLM 5.1 sits alongside frontier models in my social reasoning benchmark
New benchmark shows GLM 5.1 competes with Claude Opus 4.6 in social deduction games for just $0.92 per game.
Zhipu AI's latest large language model, GLM 5.1, is showing impressive capabilities in a unique benchmark designed to test advanced social reasoning. The benchmark, created by a community researcher, pits AI models against each other in autonomous games of 'Blood on the Clocktower,' a complex social deduction game requiring strategic deception, alliance-building, and long-term planning. Early results indicate GLM 5.1's performance is competitive with established frontier models like Anthropic's Claude Opus 4.6, a notable achievement for the Chinese AI firm.
Beyond raw performance, the cost efficiency is striking. While Claude Opus 4.6 costs approximately $3.69 to run a single game simulation, GLM 5.1 accomplishes the same task for just $0.92—a 75% reduction. Furthermore, GLM 5.1 reportedly achieved a 0% tool error rate in this test, indicating reliable execution of the multi-step actions required to play the game. This combination of high competency in a nuanced task and dramatically lower cost presents a compelling value proposition.
This benchmark moves beyond standard multiple-choice tests to evaluate how models handle dynamic, interactive scenarios with imperfect information. Success requires understanding character roles, fabricating consistent narratives, and adapting strategies based on other players' actions—skills directly applicable to real-world AI agent coordination and complex customer service simulations. The results suggest the performance gap between Western and Chinese frontier models may be narrowing in specific, sophisticated domains.
- GLM 5.1 costs $0.92 per benchmark game, 75% cheaper than Claude Opus 4.6 at $3.69.
- Achieved a 0% tool error rate in a complex social deduction game requiring multi-step actions.
- Benchmark uses 'Blood on the Clocktower' to test strategic deception and long-term planning beyond standard QA.
Why It Matters
Demonstrates a high-performance, cost-effective alternative for running complex multi-agent simulations and interactive AI applications.